“Calhoun 


Institutional Archive of the Naval Postgraduate School 





Calhoun: The NPS Institutional Archive 
DSpace Repository 


Theses and Dissertations 1. Thesis and Dissertation Collection, all items 


2010-09 


The circular pipeline achieving higher 
throughput in the search for bent functions 


Johnson, Christopher D. 


Monterey, California. Naval Postgraduate School 
http://hdl.handle.net/10945/5204 


This publication is a work of the U.S. Government as defined in Title 17, United 
States Code, Section 101. Copyright protection is not available for this work in the 
United States. 


Downloaded from NPS Archive: Calhoun 


Calhoun is the Naval Postgraduate School's public access digital repository for 


\§ D U DL EY research materials and institutional publications created by the NPS community. 
«iit Calhoun is named for Professor of Mathematics Guy K. Calhoun, NPS's first 


hl LIB ol a Spoon ten a a eee au 


Dudley Knox Library / Naval Postgraduate School 
411 Dyer Road / 1 University Circle 


http://www.nps.edu/library Monterey, California USA 93943 





NAVAL 
POSTGRADUATE 
SCHOOL 


MONTEREY, CALIFORNIA 


THESIS 


THE CIRCULAR PIPELINE: 
ACHIEVING HIGHER THROUGHPUT IN THE SEARCH 
FOR BENT FUNCTIONS 


by 


Christopher D. Johnson 


September 2010 


Thesis Co-Advisors: Jon T. Butler 
Pantelimon Stanica 





Approved for public release; distribution is unlimited 


THIS PAGE INTENTIONALLY LEFT BLANK 


Public reporting burden for this collection of information is estimated to average | hour per response, including the time for reviewing instruction, 
searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send 
comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to 
Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 
22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. 


1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED 
September 2010 Master’s Thesis 


4. TITLE AND SUBTITLE 5. FUNDING NUMBERS 
The Circular Pipeline: Achieving Higher Throughput in the Search for Bent 

Functions 

6. AUTHOR(S) Christopher D. Johnson 


|6. AUTHOR(S) Christopher D. Johnson 

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION 
Naval Postgraduate School REPORT NUMBER 
Monterey, CA 93943-5000 


9. SPONSORING /MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSORING/MONITORING 
N/A AGENCY REPORT NUMBER 


11. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not reflect the official policy 
or position of the Department of Defense or the U.S. Government. IRB Protocol number N/A , 


12a. DISTRIBUTION / AVAILABILITY STATEMENT 
Approved for public release; distribution is unlimited A 

13. ABSTRACT (maximum 200 words) 

For the first time, the circular pipeline as a means to significantly improve the throughput achieved in the search for 
bent functions is presented in this thesis. Linear cryptanalysis attack is a threat to modern symmetric encryption 
systems. A good defense is the use of a primitive based on Boolean functions having the highest nonlinearity 
possible—a bent function. Bent functions are extremely rare and, therefore, difficult to find. The implementation of 
a sieve on a field programmable gate array (FPGA) provides a high throughput (one function per clock) approach to 
searching for bent functions. With a clock frequency of 100 MHz, throughput is 100,000,000 functions per second. 
The circular pipeline as a way to achieve an even higher throughput is examined in this thesis. The theoretical 
maximum speedup is 2”, where n is the number of variables. The exact achievable speedup has been unknown until 
now. It is shown that a speedup of 55 is achieved at n = 6 with the design proposed in this thesis, which is 86% of the 
theoretical maximum. 





14. SUBJECT TERMS Circular Pipeline, Boolean Bent Functions, Hardware Complexity, Circuit 15. NUMBER OF 
Complexity, Nonlinearity, Hamming Distance, Cryptography PAGES 
116 


16. PRICE CODE 


17. SECURITY 18. SECURITY 19. SECURITY 20. LIMITATION OF 
CLASSIFICATION OF CLASSIFICATION OF THIS CLASSIFICATION OF ABSTRACT 
REPORT PAGE ABSTRACT 

Unclassified Unclassified Unclassified UU 


NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89) 





THIS PAGE INTENTIONALLY LEFT BLANK 


il 


Approved for public release; distribution is unlimited 


THE CIRCULAR PIPELINE: 
ACHIEVING HIGHER THROUGHPUT IN THE SEARCH FOR BENT 
FUNCTIONS 


Christopher D. Johnson 
Lieutenant, United States Navy 
B.S., University of Michigan, 2003 


Submitted in partial fulfillment of the 
requirements for the degree of 


MASTER OF SCIENCE IN ELECTRICAL ENGINEERING 


from the 


NAVAL POSTGRADUATE SCHOOL 
September 2010 


Author: Christopher D. Johnson 


Approved by: Jon T. Butler 
Thesis Co-Advisor 


Pantelimon Stanica 
Thesis Co-Advisor 


Clark Robertson 
Chairman, Department of Electrical & Computer Engineering 


ill 


THIS PAGE INTENTIONALLY LEFT BLANK 


iv 


ABSTRACT 


For the first time, the circular pipeline as a means to significantly improve the throughput 
achieved in the search for bent functions is presented in this thesis. Linear cryptanalysis 
attack is a threat to modern symmetric encryption systems. A good defense is the use of 
a primitive based on Boolean functions having the highest nonlinearity possible—a bent 
function. Bent functions are extremely rare and, therefore, difficult to find. The 
implementation of a sieve on a field programmable gate array (FPGA) provides a high 
throughput (one function per clock) approach to searching for bent functions. With a 
clock frequency of 100 MHz, throughput is 100,000,000 functions per second. The 
circular pipeline as a way to achieve an even higher throughput is examined in this thesis. 
The theoretical maximum speedup is 2”, where n is the number of variables. The exact 
achievable speedup has been unknown until now. It is shown that a speedup of 55 is 
achieved at n = 6 with the design proposed in this thesis, which is 86% of the theoretical 


maximum. 
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EXECUTIVE SUMMARY 


Computer hardware architecture that speeds up the process of sieving through a pool of 
functions in search of a set of characteristics is presented in this thesis. This 
architecture—the circular pipeline—is motivated by the search for the most nonlinear 
functions, known as bent functions, due to their usefulness in cryptographic applications. 
Bent functions provide for a defense against linear cryptanalysis attack. A linear attack 
attempts to break the cipher key using a series of linear approximations for the key. If 
successful, linear characteristics of the cipher key are exploited and the encryption is 
broken. Bent functions are the least linear of all functions, making them most resistant to 


linear cryptanalysis attack. 


No analytic method is known to solve for bent functions, so large pools of 
candidate functions must be tested in order to find bent functions. Bent functions are 
well defined and testing is straightforward. However, the pools of candidate functions 
are so large that modern processing power is insufficient to exhaustively sieve through all 
possibilities. Utilizing the parallelism afforded by reconfigurable computing on the SRC- 
6, we achieved a speedup of over 60,000 times over a conventional processor at the 
Naval Postgraduate School. The speedup achieved through parallel processing is 


improved through more efficient use of the parallel stages in the circular pipeline design. 


The conventional parallel design tests a single function per clock period. To 
discover a bent function, it must be tested against all linear functions; therefore, the 
conventional design contains tests for all linear functions in parallel. Each test consists of 
calculating the nonlinearity of the function under test and determining if it is a bent 
weight. A bent weight is easily defined, and this part of the test is completed with two 
comparators, one for each of the two bent weights. The nonlinearity is calculated with a 


bitwise exclusive-OR followed by a tree of adders that sum the resulting number of ones. 


The circular pipeline uses the same test modules used in the conventional design, 
but controls the flow of functions through the stages differently. Rather than applying a 


single function to all stages simultaneously for testing, a distinct function is applied to 


XV 


each test module, which is a stage of the circular pipeline. If a bent weight is found, the 
function is advanced to the following stage, where another test is applied. If a bent 
weight is not found, the function is discarded and the following stage accepts a new 
function from the function generator. A function is continually passed to a subsequent 
stage as long as it passes tests. If a function passes all tests, it is bent. As soon as a 
function fails a single test, it is ejected, making room for a new function to be inserted to 
the pipeline and tested. The result is more efficient use of the stages compared to the 


conventional design that performs simultaneous tests. 


Exactly what speedup is achievable is related directly to how much more 
efficiently the stages are utilized. This efficiency, in turn, is directly related to how many 
stages functions tend to pass before failing (and being ejected from the pipeline). Due to 
the rarity of bent functions, a function selected at random is more likely to fail an 
individual stage test than to pass. Therefore, a great deal of efficiency, realized as 
throughput and ultimately speedup in total computation time, is gained with circular 


pipeline architecture. 


The circular pipeline requires additional logic to control the additional complexity 
of information flow through the stages. Conventional speedup gained through 
parallelism is done so at a cost of doubling logic resources to double throughput. 
Therefore, the circular pipeline must have a better speedup to increased-logic ratio to be a 


technological improvement. 


Two primary design variations were developed and tested. The first uses a 
reservoir queuing system to equitably distribute functions from a single function 
generator to all stages. This design resulted in the greatest speedup, but logic resource 
consumption was too great to make it practical and could only be realized for very simple 
cases. The second design implemented independent function generators, one for each 
stage, in order to eliminate the reservoir and providing an economical speedup. A 
contribution of this thesis is to demonstrate a speedup to logic-resources-demand ratio of 
55:2.3. Conventional parallelism yields a ratio of 1:1. Furthermore, the trend of this 


ratio improves as complexity (the number of variables) of the circular pipeline increases. 
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I. INTRODUCTION 


A. LINEAR CRYPTANALYSIS 


Matsui [1] introduced the linear cryptanalysis method that succeeded in breaking 
the Data Encryption Standard (DES) block cipher. DES was endorsed by the United 
States Bureau of Standards in 1976 and was ubiquitous in data encryption applications 
into the 2000s. Matsui’s linear cryptanalysis method uses a series of linear 
approximations to decipher the target message. The use of a highly nonlinear Boolean 
function in the encryption process is an effective defense against such a linear 
cryptanalysis attack. Bent functions are highly nonlinear, and therefore useful in securely 


encrypting data. 
B. ENUMERATION OF BENT BOOLEAN FUNCTIONS 


While the precise definition of a bent function is straightforward, generating a 
bent function is not. Currently, our approach to enumerating all n-variable bent functions 
is to exhaustively test a large pool of candidate n-variable functions using a sieve 
technique. It has been demonstrated that a reconfigurable computer is an efficient way to 
test functions for bentness [2]. Until now, the architecture implemented on the SRC-6 at 
the Naval Postgraduate School tests a single function in truth table form simultaneously 
against all affine functions (or a subset thereof determined to be adequate). The 
parallelism afforded by the reconfigurable computer to perform simultaneous tests 


provides a speedup factor of greater than 60,000 over a conventional processor [2]. 
C; SPEEDUP USING A CIRCULAR PIPELINE 


An inherent inefficiency with the current architecture is that a majority of the 
simultaneously performed tests reconfirm the same conclusion—that the function under 
test (FUT) is not bent. This is a result of the rare nature of bent functions. Each of the 
parallel tests is performed with a distance calculator that finds the distance between an 
affine function and the FUT. All tests must be applied and passed to declare that a 

1 


function is bent. That is, only one test needs to fail to determine a function is not bent. 
In the majority of cases, a function fails many tests. We seek a method in which a 
function is subject to individual tests sequentially and is immediately ejected when it fails 
one test. In this way, the test units are more efficiently used and the throughput is 
greater. FUTs that pass are forwarded to subsequent distance calculator stages until they 
either fail their first test or pass all tests. In this way, the information obtained from 
every test conducted is an essential operation. No resources are wasted performing 


unnecessary tests [4]. 


With the circular pipeline architecture, the maximum throughput possible is the 
number of stages S. This is achieved when all functions fail. The average will be less. 
This compares to a fixed throughput of 1 function per cycle with the conventional sieve 


architecture [4]. 


Although the number of distance calculators (each belonging to a stage in the 
circular pipeline) remain constant, an increase in the pipeline’s control unit logic is 
expected to be required for a circular architecture. This is due to the increase of possible 
routes for data to flow into and out of each pipeline stage. Each stage of the conventional 
architecture always accepts a new function from the function generator and always passes 
its result along. A circular pipeline stage may or may not accept a new function from the 
function generator, may or may not accept a function from the preceding stage, and may 


or may not pass a function it tests to the subsequent stage for further testing. 


Discovering the exact tradeoff between speedup and additional logic resource 


requirements of the circular pipeline architecture is a key area of interest. 
D. THESIS GOALS 


This thesis investigates the amount of speedup realizable with circular pipeline 
architecture implemented on the SRC-6. Insight into this will guide further advances in 
bent function discovery using the sieve technique along with possibly providing useful 
data for high-speed calculation of other mathematical operations amenable to circular 


pipeline architecture. 


E. THESIS ORGANIZATION 


A basic overview of this thesis is presented in Chapter I. Background information 
is presented in Chapter I]. The design proposed by this thesis to attain calculation 
speedup is detailed in Chapter II]. Implementation issues are addressed in Chapter IV. 
Results and analysis are presented in Chapter V. The thesis summary and suggestions for 
future research in this area, specifically potential improvements to the proposed circular 


pipeline architecture, are presented in Chapter VI. 
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I. BENT FUNCTION DISCOVERY USING SIEVE 


A. FUNCTIONS 
1. Definitions 


a. Boolean Functions 


A Boolean function f on n variables is a map from the n-dimensional 
vector space V, = F’ to F5, the two-element field. For a function f, let fy = (0,0,...,0), fi = 
Ji0,0,...,1), ..., and Toma = f(1,1,...,.1). TT = (ofA. ... Tow) is the truth table representation 


of f [2]. 
b. Linear Functions 


A linear function is the constant zero function or the exclusive-OR (XOR) 


of one or more variables [2]. There are 2” linear functions. 
c. Affine functions 


An affine function is a linear function or the complement of a linear 


function [2]. There are 2”*' affine functions. 
d. Nonlinearity (NL) 


The nonlinearity NZ; of a function fis the minimum Hamming distance 
between fand an affine function, where the Hamming distance between two functions is 


the number of places where their truth table representations differ [2]. 


e. Bent Weight 


A bent weight is defined to be a nonlinearity of 2 22, [1]. Ifa function 


is found to have a bent weight for a linear function, it will have also have a bent weight 


for that linear function’s complement. Therefore, it is sufficient to test only against all 


linear functions [2]. 
ee Bent Functions 


A bent function has a maximum nonlinearity among n-variable functions, 
where n is even. A bent function will have bent weights for all 2” linear functions (and 


implicitly, all 2”*' affine functions) [2]. 


It follows that a small portion of the 2° functions of an n-variable function are 


bent. For n = 4, 28; = 1.3% of the 4-variable functions are bent. This percentage 





decreases as n increases. For example, n = 6 has a bent function ratio of 


5,425,430,528/27 = 2.94x10 °% [a 
g. Throughput (T) 


Throughput 7 is the rate at which functions are processed, given in units of 


functions per clock. 
B. PARALLEL SIEVE ARCHITECTURE 


An approach to discover all bent functions for n-variable functions is to 
enumerate all possible truth tables sequentially and apply each to all affine functions 
simultaneously. As depicted in Figure 1, the FUT is bitwise XOR’d with each affine 
function, then “Ones Count’ logic determines the number of resulting ones (the Hamming 
distance), followed by a ‘Minimum’ circuit that finds the lowest value for all the “Ones 
Count’ inputs. The output of ‘Minimum’ is the nonlinearly of the function. Together, 
these modules are distance calculators, providing the distance between two inputs—an 
affine function and a FUT. This process is pipelined to achieve a clock rate of 1OOMHz 
with throughput of one function per clock on the SRC-6. Each module of the distance 


calculator will now be discussed in further detail. 








Function ) > 
@ 


Figure 1. Sieve Architecture for Bent Function Discovery. From [5] 


1, XOR Operation 


The bitwise XOR operation of bus width 2” is constructed of 2”/2 parallel 2-input 
XOR gate. This is depicted in Figure 2. 


Distance 
Vector to 
Affine Fun 





Figure 2. Bitwise XOR Architecture. From [5] 


2. Ones Count 


n 


The Ones Count circuit is constructed as a tree beginning with + 4-input adders 


and ending with a 2n-wide adder with an n+1-wide output that is the Hamming distance 


to the affine function. This design is illustrated by Figure 3. 


OnesCount2 produces a binary 

number that is the number of 1's 

across 4 input variables. (truth 

tables of 2-variable functions). 4 

4 inputs are chosen to match LUT 
XX of Virtex-Il FPGA. 











Binary number that 
is the distance to 
the affine function. 







Distance 
Vector to 
Affine Fun 





Figure 3. Ones Count Architecture. From [5] 


3. Minimum 


The minimum circuitry is also constructed as a tree, with each building block 
receiving two n+1-wide inputs (the results from the Ones Counts modules) and producing 


the n+1-wide nonlinearity in binary. This architecture is depicted in Figure 4. 


Min produces 
the minimum of 


two values. 
Min \ 1 | 


XY 







NL, the nonli- 
nearity of the 
tested function. 








—# 
—_— 
= Min | n+4 
hd From OnesCount of 2" 


affine functions. 


Figure 4. Minimum Module’s Architecture. From [5] 
8 


C. ADVANTAGES 


The principle advantage of this architecture is that a large number of operations 
are performed in parallel that would otherwise have to be executed in serial on a 
conventional CPU. For example, a bitwise XOR operation is required for each affine 
function, which amounts to a total of 2”"' operations, or more if the conventional 
processor cannot accommodate a 2”-wide bitwise XOR. The ability to execute all of 
these operations in parallel amounts to a significant time savings over conventional 


processors for large n-variable functions [5]. 
D. DISADVANTAGES 


The principle disadvantage of this parallel sieve technique is that, for any one 
cycle, the distance calculators provide redundant information about each non-bent FUT, 


which typically fail many of the parallel tests. 
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HI. CIRCULAR PIPELINE SIEVE ARCHITECTURE 


An improvement in computational time to discover all bent functions for a given 
n is sought by achieving greater utilization of the distance calculators. The sieve consists 
of 2” stages that each computes the distance between fand one of the 2” linear functions. 


Then, it determines if its distance is a bent weight 2”! +277", 
Persistence (P;) 


Persistence is the number of stages a function f; is subjected to before removal 
from the circular pipeline. P; is equal to the number of passed tests for bentness (one per 
stage) plus one (for the stage that removes f). P is the average persistence over all 


functions. 


If a function /; is found to have a bent weight, its persistence P; is incremented and 
it is passed to the next stage. If fis found not to have a bent weight, it is ejected from the 
circular pipeline and the following stage accepts a new function. In the case that fj is 


bent, P; will grow to 2”. Then, f; is removed from pipeline and stored [4]. 


The speedup of the circular pipeline depends on the throughput, which will be 
1 <7 <2". The lower bound occurs if all functions in the pipeline are bent, while the 
upper bound occurs when none of the functions in the pipeline have a bent weight and are 


therefore ejected after one cycle [4]. 
A. RESERVOIR 


For each cycle, 2” functions must be made available to the circular pipeline in 
case all previously tested functions were ejected. The sieve procedure begins with a 
single function generator very similar to that used in the conventional design providing 
these sets of functions. However, not all of these 2” functions will be accepted by the 
circular pipeline because some functions in the circular pipeline will persist, blocking a 


new input. To achieve exhaustive testing, a reservoir for these unaccepted functions must 
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be provided so they may be inserted into the pipeline at a later time. Further, a 
mechanism to provide the functions stored in the reservoir to the circular pipeline, vice a 


new set from the function generator, must be incorporated. 


The reservoir is shown in Figure 5. Functions enter through a multiplexor (MUX) 
that is sourced with two complete sets of 2” functions one from the function generator 
and the other from the reservoir. If a stage in the circular pipeline is available, a function 
fi provided by the MUX is inserted. If not, the f; is routed to the lowest available of the 


2”"'— | registers, beginning with Ly. 


Figure 5 is an illustration of the reservoir for n = 2. The circular shape at the top 
of Figure 5 is the circular pipeline with the 4 stages for n = 2. Lo through Q> are the 
2”*'-1 registers required to ensure registers are available for rejected functions in the 


worst-case scenario. The blocks labeled / are the 2” functions applied by the MUX. 
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Figure 5. Reservoir Architecture. 
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The purpose of the reservoir is to store functions rejected by the circular pipeline, 
so they can be reinserted later. These temporarily stored functions must be queued such 
that they can be presented to the circular pipeline as a complete set of 2” functions. A 
major problem associated with queuing the functions to form a complete set is assuring 


that no empty registers exist between occupied registers. 


The top registers O 


Nien SEE replicated for the purpose of illustration. It must 
be known how many empty registers reside below each incoming function /; (provided by 
the MUX). Summing the number of Z occupied registers with an adder chain is required 
when the L registers are not all filled. The addition operation needed to sum all occupied 
L registers is special in that if a stage is found to be occupied, all stages below it are 
occupied as well. Therefore, a thermometer-type adder, or thermo adder, is used to 


provide this sum. 


Analysis of all possible cases revealed that when the LZ registers are completely 
occupied, the same thermo adder simply needs to be applied to the Q registers. This is 
because the Q registers will slide down to fill the P registers from the bottom up and the 


incoming functions / will fill in atop these. 


The sum produced by the thermo adder is the input to a chain of adders associated 
with the incoming / functions. A 2”-bit signal inToPipe, from the circular pipeline, is 
used in the same fashion as the occupied bits are used with the registers. An asserted 
inToPipe; indicates that the pipeline stage QO; requires J; on the next clock; hence, J; will 
not be stored in the reservoir. If inToPipe; is low, J; will be routed into the reservoir. The 
adder chain accounts for the presence of /; in the reservoir, which is needed to determine 


proper routing of other incoming / functions above Jj. 


The lowest index / function rejected by the circular pipeline is routed to the 
lowest indexed available register. The next lowest indexed / function rejected from the 
circular pipeline is stored in the register directly above where the lowest indexed / 
function is stored. With this behavior, for each function J to be routed correctly, the 


number of occupied registers below is needed, to include any other lower indexed J 
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functions that are being routed to the reservoir on the same clock. The adder chain, 
applied to the occupied bits of the registers and the inToPipe bits of J, provides this 


number and allows for proper routing. 


When the top L register, L,, is filled, a select signal is asserted and the MUX 


applies the set of 2” L functions from the reservoir. Functions in Q registers slide down 
to the similarly indexed L register, ensuring the reservoir is filled from the bottom up. 
When the MUX selects functions from the reservoir, the function generator must be 
inhibited, which is controlled by the same line used as input to the OR gate that feeds the 
MUX select. When the function generator has completed generating all functions, a done 
signal is sent to the reservoir. This signal also feeds the OR gate leading to the MUX 


select, which routes any remaining functions in the reservoir to the circular pipeline. 


Despite being auxiliary, the reservoir is the most complex part of the circular 
pipeline. An estimate the growth rate of reservoir complexity as a function of 7 is given 
in Table 1. The number of connection paths and individual wires required (connections 
multiplied by bus width) by the reservoir to accompany the circular pipeline for given n 
are listed in Table 1. The minimum number of transfer paths occurs for Jo, which has 2” 
possible paths. There is no case for which Jp will be routed to any of the Q registers. /; 
can be routed to any L register or Op, J, could be routed to any L register, Op or Q;. This 


pattern continues until reaching J, , which could be transferred to any of the registers. 


This gives a maximum number of transfer paths of 2”"'— 1. 


Max 
> TransferPaths 
The total number of transfer paths is given by “in +2n-—1. The 2n 


— 1 term accounts for the paths for each Qi register to transfer to its corresponding Li 
register. The total number of wires required is found by multiplying the total transfer 
paths by bus width of f, which is 2n. Lastly, the growth rate column shows the growth 
factor of the total number of required wires with respect to the previous row. Bearing in 
mind that this table omits odd n, we deduce that the complexity of the reservoir grows by 
approximately 8n. The circular pipeline is expected to grow at a rate of approximately 


2n, which is the growth rate of the number of stages. This indicates the reservoir 
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complexity will likely be a limiting factor as n increases and motivated an alternate 


approach that allows removal of the reservoir. This is discussed in Section C.2. 


Table 1. Reservoir Complexity. 






































Max Minimum Total 
Bus ; Growth 
n Stages | Transfer Transfer Transfer Total Wires 
Width Rate 
Paths Paths Paths 
2 4 qi 4 25 4 100 - 
4 16 31 16 391 16 6256 63 
6 64 127 64 6175 64 395200 63 
8 256 511 256 98431 256 25198336 64 
10 1024 2047 1024 1573375 1024 | 16111136000 64 























B. CIRCULAR PIPELINE 


Each stage of the circular pipeline is similar to the parallel nonlinearity computers 
of the conventional sieve architecture. However, additional logic is required to handle 
the additional complexity of data flow. For each stage, a control unit must determine if a 
function should be advanced to the next stage or ejected; additionally, whether or not a 


function is incoming from the preceding stage or a new incoming function should be 


accepted. 


To accomplish this, a 1-bit signal inToPipe; indicates if the stage QO; is accepting 
the incoming function J; from the MUX. If not, J; is stored in the reservoir. The 2”-bit 


intToPipe vector is used by the reservoir queuing unit to properly route functions to 


registers in the reservoir. 


An n-bit persistence P token accompanies each function throughout its procession 
in the circular pipeline. A test must be performed to detect when P > 2”, at which time 


the FUT is determined to be bent, removed from the pipeline, and stored. 
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1, Data Flow and Control Logic Complexity Comparison 


The additional complexity required (which translates directly to logic (LUTs on 
the SRC-6) required for design realization) is best understood by comparing data flow 
through a traditional linear pipeline to the flow through a circular pipeline. Figure 6 is a 
graphical depiction of the basic flow of information through a linear pipeline. For bent 
function searches, this 4-stage pipeline applies to n = 2 and each stage is testing f against 
a distinct linear function for a bent weight. If the function passes through all stages, 
never failing a test, it is declared bent. Each stage has one input and one output and 
completes its calculation in one clock. The architecture to control information flow is 


simple, and throughput is fixed to one function per clock. 


In 





Figure 6. Linear Pipeline Information Flow. 


Figure 7 is a depiction of the flow of information through a circular pipeline. 
Figure 7a is the initial adaptation of the linear architecture and Figure 7b is a modified 
version of 7a with the output of stage four wrapped around to be the input of stage one. 
From this illustration, it is immediately clear that greater complexity is required to control 
the flow of functions through the pipeline. Each stage now has a choice between two 
inputs and two outputs, which requires controlling logic. An increase in throughput T is 


the expected payoff. 
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(a) (b) 


Figure 7. Circular Pipeline Information Flow. 


The design for optimal T by enabling every stage to output a result is depicted in 
Figure 7. With the application we are applying to the circular pipeline, we choose to 
simplify the design by allowing only one stage to output functions that are determined to 


be bent, as illustrated in Figure 8. 





In 


Figure 8. Circular Pipeline Data with One Stage Output. 
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This simplifies the output interface by disallowing the case that 2” functions are 
found to be bent and sent as output on the same clock. If such a case were allowed, as in 
Figure 7, the output bus would have to be 2”” bits wide in order to simultaneously transfer 
2" words of 2” bits each. The SRC-6 can support at least 16 output streams of 320 bits 
each [6]. Therefore, there is no restriction on output stages through at least n = 4. 
Nonetheless, the simpler design of a single output stage comes with the associated 
benefits of simpler logic. With the simplification, illustrated by Figure 8, the output bus 
is 2” bits wide and the instances of logic required to check the value of P is reduced from 
2" to 1. With this design, every stage has two inputs from which to choose and only one 
output (to the following stage), save for the one special stage that has an additional output 
for functions determined to be bent. Additional ideas regarding this issue are presented 


in Chapter VI Further Research. 


C. FUNCTION GENERATOR 


1. With Reservoir 


The circular pipeline with reservoir architecture requires a function generator that 
provides 2” functions on each clock and can be inhibited. This is an extension of the 
simple counter in the conventional architecture that provided one function and always 
incremented on each clock. In the conventional architecture, a simple counter used as the 
function generator was produced with C-style statements implemented on the field 
programmable gate array (FPGA). This is discussed in greater detail in the sections on 


Verilog and SRC-6 implementation. 


The function generator is also a simple counter when the circular pipeline is used 
with a reservoir. On each clock, the function generator produces 2” functions, one for 
each stage of the pipeline. The most significant n bits of each function f; are hardwired to 
i (in binary). A 2°” bit counter is concatenated onto the least significant bits. In this 
way, 2” distinct truth tables of functions, each 2” bits long, are formed by the function 
generator on each clock. The counter is inhibited on any clock that the reservoir’s L 


registers are completely filled because in this case the reservoir provides the functions. 
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The counter holds its value until the next clock for which the Z registers are not 


completely filled (most likely the very next clock), then resumes incrementing. 


A done signal accompanies the FPGA-based function generator. After all 
possible functions have been cycled through, a done bit signals function generator 
completion. This signal also asserts the select bit on the input MUX, causing any 
functions in the reservoir to be routed for insertion to the pipeline. Additionally, the 


counter done signal initiates termination counter. 


The final countdown is 5 x 2"”—1 clocks. This number of clocks is the worst-case 
for how long it could take to flush the reservoir and circular pipeline. It occurs when all 
functions in the circular pipeline (1.e. when the function generator signals done and the 
reservoir is full with 2”"' — 1 bent functions). If this were to happen, it would take 2” 
clocks before the pipeline would accept any functions from the reservoir. After these 2” 
clocks, one function per clock would be inserted to the pipeline, and each would persist 


2”? _ 1 clocks and is 


2" clocks. The last function from the reservoir is inserted after 
determined bent after 2” clocks, for a total of 5 x 2”— 1 clocks. When this number of 
clocks is reached, following the function generator signaling completion, the exhaustive 


test is declared complete and a done signal is asserted. 


Using a final countdown rather than testing for and generating signals to indicate 
the absence of FUTs in the pipeline is a tradeoff between circuit complexity and speed. 
The final countdown requires the test to continue running for the entire duration of the 
worst-case scenario, which is unlikely. Additional logic could terminate the test as soon 
as all functions are removed from the pipeline saving many of the 5 x 2”— 1 clocks. But, 
this is a very small percentage of the total number of clocks required for the test. 
Simplifying the circuit and adding a small number of clocks to the test operation was the 


favored choice. 
2. Without Reservoir 


Due to the complexity of the reservoir, an alternative design was constructed. In 
this design, individual function generators exist for each stage. The single function 


generator used in the conventional and circular pipeline with reservoir architectures is 
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replaced by an array of 2” independent function generators (IFGs). Both designs 
continuously produce 2” truth tables of functions. Each IFG; has its 1 uppermost bits 
hardwired to its index (in binary), which range from 0 to 2”. The remaining lower order 
bits of each IFG are an independent simple counter. The counter is inhibited any time its 
associated stage receives a function passed from the preceding stage. If a FUT in a 
preceding stage fails, no function is passed, a function from IFG; is inserted into its 


corresponding stage S,, and then IFG; is incremented. 


A disadvantage with this approach is the inefficiency resulting when IFGs 
complete their cycle and then remain idle until the last IFG completes. Any S; is 
underutilized from the time IFG; completes until the last IFG completes. This is because 
there is no function available for insertion when the S; is open; S; continues only to test 


functions passed from the preceding stage. 


The circular pipeline with reservoir does not have this inefficiency because 
functions are redistributed equitably to all stages until no functions remain. It was 
postulated that the delta between IFGs’ completion times would not be significant, 
especially as n increases. Due to the nature of bent functions, all stages are expected to 


have an equal probability of passing or rejecting a function selected at random. 


In this configuration, each IFG signals completion and its input to the stage is 
invalidated. All 2” function generator’s done signals are AND’d with the 2” inToPipe 
signals, one from each stage. Each asserted inToPipe; signal indicates the FUT in stage; 
was found not to have a bent weight. The output of this 2”"'-input AND function is 
thereby asserted when all function generators have completed and there is no function 
remaining in the circular pipeline with a bent weight. This signals completion of the 


exhaustive test. 
D. PERSISTENCE 


Throughput is directly related to the average persistence, with the upper bound of 
2" if all functions were to persist for only one clock period, and a lower bound of | if all 
functions persist the 2” cycles required to determine a function is bent (theoretically, 


throughput could be a small fraction less than 1, which is explained below). 
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linear function returns a bent weight of 2”! + 2771, 


A function persists in the circular pipeline as long as the bitwise XOR with each 


The exact persistence of each 


function will depend on where in the circular pipeline it is inserted and the order with 


which the linear functions are placed amongst the stages. 


Having no insight into 


advantages with any particular ordering of linear functions within stages, we give no 


attention to this issue. 


We expect that the average persistence will depend on the 


percentage of bent weights contained within all possible functions. A development of 


this fraction of bent weights is provided in [4]: 


N,, are the expected number of bent and non-bent weights for the given A,. The sum of B, 


For each value of n, there are 2” n-variable functions, each of which has 


a distance value to 2” linear functions for a total of 2?*” instances of a 
weight. There are 2” linear functions, each of which is a distance 2”! + 
2") from (on Pe) other functions, for a total of 2” ea Py] 


instances of a weight of 2”! + 2”7". 
weight that are 2”! + 2”*" is 


oy 9" 0" 
4 _ gael pnlaal _ gal {onal 


a Oe tn ae 


Thus, the fraction of instances of 





(1) 


The results of the algorithm for even n, 2 <n < 8, are included in Table 2. B,, and 


and N,, is 2”. In practice, we cannot have fractional values. So, for this development of 


an estimation of throughput and average persistence, we round B, and N,, to the nearest 


integer, notated as [B,,] and [N,]. 



























































Table 2. Throughput and Average Persistence. From [4] 
n| A, Expected | Expected | 2” | [B,] | [Ny] | Calc.. | Calc. T: Exp. | E 
Bn Nn Pavg Tn Upper | Pave | X 
2 | 0.500 2.0 2.0 4 2 2 1.40 | 2.86 4 2.50 | 1 
4} 0.244 3.9 12.1 16 | 4 125) 30. de 16 1.65 | 9 
6] 0.121 7.8 56.2 64 | 8 56 | 1.14 | 56.1 64 - |- 
8 | 0.060 15.5 240.5 25 | 16 | 240 | 1.07 | 240. 256 - |- 
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To calculate Pag for n = 4, we proceed as follows. There are five possible 
sequences of weights for a function to encounter upon insertion to the circular pipeline. 


These are illustrated in Table 3. 











Table 3. | Example Computation of Throughput for n = 4. From [4] 
Sequence of Weights B and N | Time in | Number 
x is either B or N, such that | Pipeline | of Combi- 
there are 4B’sand12N’s._ | (clocks) | nations 
NxxX XXXX XXXX XXXX 1 (7) 
BN (13 
XX XXXX XXXX XXXX 2 3 
BBNX XXXX XXXX XXXX 3 (13) 
BBBN xxxxX XXXX XXXX 4 ('?} 
BBBB NNNN NNNN NNNN 5 (i 


























In Table 3, an ‘x’ represents either a bent weight B or non-bent weight N, the 
exact placement of each is unimportant, but must total the [B,]| and [N,,] values given in 
Table 2. The first entry of Table 3 means that fis inserted into a stage for which it does 
not have a bent weight. It is ejected from the pipeline, and its total time in the pipeline is 
one clock. In the circular pipeline architecture, functions are always ejected immediately 
upon failing to test for a bent weight. Of the 15 x’s following the initial NV, four are bent 
weights and 11 are non-bent weights, which totals B, = 4 and N, = 12. The number of 


combinations for four bent weights to occur amongst 11 non-bent weights is given by 


(? |, as shown in the Number of Combinations column of Table 3. 


The second entry of Table 3 illustrates the scenario that a bent weight is found in 
the first stage and is advanced to a second stage. In the second stage, a non-bent weight 
is found and fis ejected from the pipeline. For this case, f spends 2 clocks in the pipeline 


and there are ('3 combinations for which this can occur. 


22 


The fifth and final row of Table 3 illustrates the scenario for which f tests for four 
consecutive bent weights in the first four stages it encounters. Since only four bent 


weights reside within any 16 tests, the final 12 stages find non-bent weights. There is 


(1d) , which is simply one. With this data we can compute the average number of clocks 


a function will persist in the pipeline for n = 4 as 
A 
- =1.31 (2) 


Sea OR ERERE 


It follows that throughput will be 





4 
Pe NOR 
P 131 


avg 


12.2. (3) 


Hence, in a 16-stage pipeline used to sieve for 4-variable bent weights, 
approximately 12.2 functions can be processed each clock. Repeating the process for 
larger n, we note from Table 3 that T approaches the upper bound of throughput as n 


increases. This is due to bent weights becoming increasingly rare as n increases. 


Butler [4] also ran a MATLAB simulation for n = 2 and n = 4 to find 
experimental values for Pag and T,,. These experimental results give lower T. A goal of 
this thesis is to provide actual values of 7, through n = 4, for the circular pipeline sieve 


run of the SRC-6. 


It is to be noted that the calculations and experimentally produced values 
developed in this section have assumed a bent function is removed from the pipeline 
upon reaching a persistence of 2”, Ppent = 2”. However, the architecture implemented in 
this thesis is simplified by allowing bent functions to be extracted at only one stage. 
Therefore, a bent function can persist longer than 2”, depending on where it is inserted to 
the pipeline relative to the location of the bent-function extraction stage. The persistence 
of a bent function fj, sen in this architecture is 2” < Pi pent < 2"! _ 1. Due to the random 


nature of function insertion location into the pipeline, the average of bent functions is 


Pp. Seo (4) 


bent 2 
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The rare nature of bent functions minimizes the impact this additional persistence 
will have on the average 7, especially as n increases, and is ignored in the development of 


Table 2. 
1. Worst-Case Scenarios 


For the circular pipeline applied as the bent function sieve, these worst-case 
scenarios are impossible. However, they are included for completeness, as they should 


be considered in alternative applications of the circular pipeline. 
a. With Reservoir 


The worst-case scenario, which would cause the 7 to fall below 1, occurs 
when the pipeline processes only bent functions for the entire duration of the test. For the 
first 2" — 1 clocks, all functions persist in the pipeline. From clocks 2” to 2”"' — 1, the 
initial 2” functions are removed and stored as bent functions. The average persistence of 
this group of 2” functions given by Equation (4). Following this initial group, T remains 
1 because all remaining functions are inserted into stage one and persist exactly 2”. 


Therefore, if the number of functions inserted into the circular pipeline is 2” , the 


A 
Si+(2"" -2") 
average persistence of this worst-case scenario is = 


b. Without Reservoir 


Without a reservoir, we have an IFG associated with each stage. The 
worst-case scenario begins the same as it does with a reservoir, with each stage receiving 
a bent function on the first clock. After 2” clocks, new functions are inserted into stage 
one, also similar behavior to the with reservoir design, and persist for exactly 2” clocks, 
giving a persistence of 1. However, IFG; will complete at which time IFG2 will begin 
inserting its functions; it was previously blocked from inserting functions because S; was 
passing a function on every clock. The P of all functions produced by IFG2 will be the 


worst case of 2”*' — 1. This pattern continues around the circular pi eline; IFG3’s 
p pip 
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functions persist 2”*' — 2 clocks, IFG4’s functions persist 2”"! _ 3 clocks, and so forth. 
Therefore, the average persistence of this worst-case scenario is equal to Poen, given in 


Equation (4). 
E. SUMMARY 


In this chapter, the circular pipeline design concept was outlined; associated data 
flow and conceptual issues were addressed. The next chapter covers implementation of 


the circular pipeline concept in hardware. 
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IV. IMPLEMENTATION 


The circular pipeline and all associated components, such as the reservoir, were 
constructed in Verilog hardware description language and run on the SRC-6. The process 


of accomplishing this is the topic of this chapter. 
A. VERILOG IMPLEMENTATION 


The circular pipeline architecture Verilog code is fully scalable to any n by 
modification of a single parameter. Behavioral Verilog augmented with a handful of 
structural statements is the coding style used. Most of the implementation of the design 
described in Chapter III into Verilog was straightforward and is not described in further 
detail. An overview of the Verilog design’s components and highlights of some specific 


issues are discussed in this section. The full Verilog code is in the Appendix. 
1. Reservoir 


The reservoir is the most complex component in the circular pipeline design, 
including the circular pipeline itself. The three main components of the reservoir are 


priority encoders, adders, and registers. 
a. Priority Encoders 


+] ode . 
2""" — 2 priority encoders are generated for the reservoir, one for each 


register except for the topmost Q,, , resister. The priority encoders for the 2" L registers 


each have 2” inputs, one for each of the 7 functions applied by the input MUX. The 


number of inputs to the priority encoders for each OQ register tapers off as 2” —i. 


Starting with Ly and working up, each register’s priority encoder produces 
the lowest-indexed function J; that is being rejected from the circular pipeline and not 
routed to a lower-indexed register. If there is no function to be routed to a given register, 


its priority encoder produces all zeros. 
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b. Adders 


Adders are used to produce the number of vacant registers below each / 
function. This number is the routing information needed to place a rejected function J; 
into the proper register, ensuring the reservoir is filled from the bottom up. The 
assurance that the reservoir is filled from bottom up allows use of a thermo adder to 


produce the value of vacant registers. 


The 2”— 1 occupied-bits / associated with the L registers are applied to the 


thermo adder if the topmost L register L,, is not occupied. If L,, , is completely filled, 


1 
the occupied-bits of the Q registers q are applied to the thermo adder. This is because, 


when L,, , is occupied, all of the L registers are transferred out of the reservoir to the 


input MUX and, simultaneously, all of the O registers are transferred index-to-index into 
the P registers on the next positive clock edge. The number of occupied registers on the 
next positive clock edge is needed for proper routing of / functions. Therefore, the / bits 


are applied to the thermo adder when L,, , is not occupied, and the q bits are applied 


when L,,_, is occupied. 


The thermo adder’s Verilog code begins by inspecting the most significant 
occupied bit g or / and proceeding down the indices. Upon finding an asserted occupied 
bit, it is known that all less significant bits will also be asserted, and a value of i + 1 is 


returned. 


The output of the thermo adder is fed into a chain of 2” — 1 adders, one for 
each J function above Jp. J receives its sum used for routing directly from the thermo 
adder. Each adder increases the input value by 1 if J;-; is being routed to the reservoir and 
provides this sum to J; and the next adder in the chain. The adder chain begins with the 
sum provided by the thermo adder and continues the running sum by adding the NOT of 
the bit inToPipe; that corresponds to its function J;. This running sum indicates the 
number of functions that will remain in the reservoir below each /; on the next positive 


clock edge. 
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c. Registers 


The 2”*' — 1 registers required by the reservoir are assigned within an 
always@(posedge CLK) statement. This statement instantiates a register and is used only 
once within the reservoir code for the purpose of creating the registers. Every register 
receives its input through a MUX that selects between the output of its priority encoder or 
the register’s current value. Each LZ; register has Q; as an additional input to its MUX for 


the cases that the O registers slide down. 
2. Circular Pipeline 


The circular pipeline is implemented using several modules that carry out the 
operations described in the previous chapter. A function was created to describe the 
behavior of a standard stage of the pipeline. This function is called 2”— 1 times. A 
modified version of the standard pipeline stage function that has the additional 
functionality of removing FUTs it determines to be bent (based on persistence) is 
instantiated once. This gives a total of 2” stages. The remainder of the module consists 


of control signals used to direct the flow of functions through the pipeline. 
B. VERILOG DESIGN DEVELOPMENT AND TESTING 


Project development was managed with Xilinx ISE 10.1. Synplify Pro D-2009.12 
was used for synthesis and ModelSimSE 6.4 was used for simulation. The general 
process was to build a section of code and synthesize. The synthesis report was then used 
to correct any errors or warnings. Then synthesis would be run again. This process was 
iterated until synthesis produced an error- and warning-free circuit that appeared 
reasonable in the register transfer level (RTL) view. Figure 9, 10, and 11 are examples of 


RTL schematics of a single circular pipeline stage for n = 4. 
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Figure 9. Synplify Pro RTL View of a Circular Pipeline Stage. n = 4. 
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Figure 10. Synplify Pro RTL View of the Bent Weight Tester Within a Stage. n = 4. 
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Figure 11. Synplify Pro RTL View of a One’s Counter Within a Bent Weight Tester. 
n=4., 
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Next, a Verilog test bench was built to specifically test the section of code under 
development. First, the testbench was run by ModelSim and the circuit under test’s 
behavior was modeled. The resulting waveform was then analyzed to ensure proper 
behavior, corrections made, and the process iterated until the behavioral Verilog was 
verified to be correct. Following the successful behavioral Verilog development, we 
mapped the Verilog design to the target FPGA and a post-MAP simulation model was 
returned by Xilinx ISE. This post-MAP model, which includes logic delay, would then 
be simulated on ModelSim iteratively until successful functionality was verified. Figure 
12 is a small section of a ModelSim post-map waveform of the circular pipeline returning 
three bent functions. Post-map simulation models include logic delay, which is evident 
by the output being delayed approximately 6ns from the positive edge of the clock (in the 
figure, the clock is slowed from a runtime period of 10ns to a period of 16ns for 
troubleshooting purposes). 

7 (GrpeSteanTBSTART 


4 [GrcPipeStreamTB/CLK 
4 (GrcPipeStreamTB/CLR 


4 |OrcPipeStreamTB DONE 
4 [CrcPipeStreamTB/DATA_OUT 

4. [CrcPipeStreamTBVALID_OUT 

4 [GircPipeStreamTB/TERM_OUT 























Figure 12. ModelSim Post-map Simulation Result Excerpt. 


C. SRC-6 IMPLEMENTATION 


With a logic design successfully tested through post-map simulation, the final step 
was implementation on the SRC-6. This involves coordinating the interaction between 
the CPU that controls the process at runtime and the logic design programmed onto the 
FPGA. Four files are required in addition to the Verilog design: main.c, info, blk.v, and 


Makefile. These files are included in the Appendix. 
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1. Macro Characteristics 


The input/output requirements of the Verilog coded circular pipeline, known as a 
macro in SRC-6 literature, must be characterized in order to choose an appropriate 
implementation. The circular pipeline requires no input aside from the system clock. It 
produces outputs that are held for one clock at unpredictable times throughout macro 
execution. This is a marked difference from the conventional macro design, which was 
called, returned a value, and terminated on each clock (the function generator was located 
outside of the macro). This highly regular behavior allowed for the use of the simplest of 


macro implementation—pure functional. 


With the characteristic that the macro returns values while continuing its run, vice 
returning a value at run termination, an external macro was also unfit for the circular 
pipeline implementation. A stateful macro remained the only possibility among the 
known types, but uncertainty remained on its suitability. Finally, on the advice of an 
SRC engineer, a streaming external macro was explored and found fit to the circular 


pipeline’s characteristics [7]. 
Zs Streaming Output 


Streaming output allows for data to be returned from the circular pipeline and 
stored in On Board Memory (OBM) on any clock throughout the duration of the sieving 
process. With the implemented circular pipeline returning a maximum of one function 
per clock, no bottleneck will occur so long as n < 7. For n = 7, the function width is 
greater than 64 bits, and so a bottleneck could occur over the 64-bit bus used to transfer 


data from the macro to OBM. 


While this was not a concern, in implementations installed for this thesis due to 
other limiting factors preventing n > 7, the stream construct can handle such a case. The 
SRC-6 stream construct includes a buffer that can be configured to handle a backlog of 
data outflow and stall the circular pipeline until the backlog is processed (e.g. transferred 


out). 
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3. CPU 


Top-level control is maintained by the CPU by the main.c file. The main.c file 


allocates memory, calls a subroutine that leads to the macro, and prints results. 
4. Subroutine and Macro Call 


The subroutine is an interface between the main.c file and the macro. It is written 
in C-style code, but implemented on the FPGA. The subroutine sets up data types, calls 
the macro in a way that supports streaming, and passes data from OBM to the CPU. In 
addition to the subroutine, the files info and blk.v configure the interface between the 


CPU and the macro. They declare the input/output data types and sizes. 
5, Timing 


For n < 5, all timing conditions are met with the circular pipeline, as describe to 
this point. For n = 6, the mapper and place and route application are unable to meet the 
timing constraint along the critical path. The SRC-6 uses a fixed clock of 100 MHz, 


which means delay along every path must be equal to or less than 10ns. 


The place and route application was unable to meet the 10ns timing constraint 
along all paths for n = 6. However, the circular pipeline behaved as expected at runtime 
for the sample set of function used. Thus, the critical paths identified by the place and 
route application are probably not the true critical paths of the circular pipeline. Rather, 
they are theoretical worst-case paths that the place and route application was unable to 


eliminate as possibilities. 
6. FPGA Resources 


For n < 7, the resources of a single Xilinx Virtex2 XC2V6000 FPGA are 
sufficient to realize the circular pipeline. For larger n, moderate changes to the SRC-6 
implementation strategy must be adapted. Further details are included in Chapter VI. 


Exact resource usage data for n < 7 is included in Chapter V. 
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D. SUMMARY 


In this chapter, the development process for circular pipeline implementation onto 
the SRC-6 was covered. The next chapter provides a results from the implemented 


circular pipeline. 
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V. RESULTS 


A. SPEEDUP 


Speedup results of the circular pipeline with IFC are summarized in Table 4. The 
clocks columns give the total number of clocks that the implemented design required to 
complete an exhaustive test. T7,,is throughput, Upper Bound is the maximum possible, 
and Realized is what was achieved at runtime. This data is from the implemented 
architecture running on the SRC-6, so it includes latency and overhead associated with 
SRC-6 process control. For small n, this overhead is a large percentage of the clocks 
needed for test completion. This is why the speedup for 1 < 3 does not closely match the 
realized 7,,. For n > 3, the overhead is a very small percentage of total number of clocks 
required to complete the exhaustive test. While the conventional design maintains a T,,+3 
of nearly unity, the increased 7+; becomes the speedup realized, rendering T,>3 


equivalent to the speedup. 


Due to excessive computational time requirements, on the order of decades, 


complete results for n = 6 are impossible. However a test set of 3.2 x 10'* (1.7 x 10 °% 


of all 2” functions required for an exhaustive test) were run and the results are prorated 


to give a value for the complete enumeration. Asterisks denote these values. 


T is calculated by dividing the number of functions processed by the number of 








clocks. 
DF 

T= 5 

"Clocks ©) 
For example, 

54 
T= 2 = 8.36 
7,840 


Speedup is calculated by dividing the circular pipeline’s clocks by the 


conventional design’s clocks. 
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Table 4. Realized Speedup. 
Circular Pipeline 7, Conventional 7, Clocks 
1} Upper Upper Speedup 
er Realized nee Realized | Conventional Circular 
2 4 0.296 1 0.078 205 54 3.8 
3 8 2.15 1 0.573 446 119 3.7 
4 16 8.36 1 0.997 65,727 7,840 8.4 
yh, 32 21.7 1 1 42.9 x 10° 1.98x 10° 21.7 
6| 64 55* 1 1 184x 10'™ | 3.33 10!” 55* 



































*Estimate based on small sample size (number of functions tested << 2”' ) 








From Table 4, it is noted that a 55 times speedup over the conventional sieve 


design is achieved by the circular pipeline. 


More importantly, there is a trend of 


increasing speedup as n increases. Figure 13 is a graph of this trend juxtaposed with the 


upper bound of 2”; it is concluded that the speedup achieved by the circular pipeline is on 


the order of 2” 
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Figure 13. Realized Throughput. 


The throughput plotted in Figure 13 does not simply follow the upper bound at a 
reduced fraction, but approaches the upper bound as n increases. This conclusion is best 


illustrated in Figure 14, which is normalized to 2”. 
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Figure 14. Throughput Normalized to 2”. 


B. RESOURCES 


A comparison of resources consumed between the circular pipeline and 
conventional design is provided in Table 5. The three resource categories are given as 
percentages of the resources available on the Xilinx Virtex-II FPGA. A slice is the basic 
building block of the FPGA. Each of the 44,096 slices contain two D flip-flog registers 
and two 4-input Lookup Tables (LUTs), for a total of 88,192 each. From Table 5, we 
conclude that LUTs are the limiting factor, as they are consumed at a higher rate than 
registers as n increases. Therefore, the column Circular Pipeline Resource Multiple is the 
fraction given by the 4-input LUTs percentage consumed by the conventional design 


divided by the percentage consumed by the circular design. 


For n < 4 the circular pipeline consumes fewer resources than the conventional 
design, as shown in Table 5. This is an unexpected and not well understood result. For n 
< 7, the additional resources consumed are less than a multiple of three over the 
conventional design. The additional resource consumption of the circular pipeline is 


attributed to its control logic. 
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Table 5. Resources Consumed Summary. 





















































n peer Registers | Occupied Slices | 4-input LUTs | Circular Pipeline 
& (%) (%) (%) Resource Multiple 
5 Conventional 4 3 3 , 
Circular 1 1 3 
: Conventional 4 6 3 ; 
Circular 1 2 3 
Conventional 5 7 4 
4 0.75 
Circular 3 5 3 
Conventional 5 9 6 
5 1.17 
Circular 5 10 7 
Conventional 7 17 13 
6 2.31 
Circular 23 25 30 
Conventional 9 42 38 
7 2.47 
Circular 50 113 94 























C. RESERVOIR TRADEOFF 


The use of a reservoir to queue and equitably distribute generated function among 
the stages provides the fastest computation. However, the large demand on logic 
resources and associated delay rendered its implementation unrealizable for n > 3. For 
n = 4, the worst-case path delay renders a maximum frequency of less than 30 MHz. 
Attempts to pipeline the reservoir for the purpose of decreasing delay such that the 100 
MHz fixed clock of the SRC-6 could be used were successful. 


A comparison between the circular pipeline (without reservoir) and the circular 
pipeline with reservoir is provided in Table 6. The number of clocks given for n > 4 in 
Figure 7 are simulation results, not runtime data from the SRC-6 like all other numbers. 
Circuits for n > 4 are unrealizable, so simulation results are required to make speedup 
comparison. In practice, if the circular pipeline with queue architecture is to be 
implemented, it would require more registers than what was reported for the unrealizable 


circuit that was synthesized. However, even with double the registers, LUTs would still 
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be the limiting factor. The number of LUTs is expected to remain constant, so the LUT 
comparison for n = 4, which is data taken from the map report, is valid. From Table 6 
and the maximum frequency for n = 4 being less than 30 MHz, it is clear that the resource 
and timing demands of the reservoir cannot be met for large n and the simpler design is 


better suited for the task. 


Table 6. Circular Pipeline With and Without Reservoir (Res) Comparison. 





























Clocks LUTs Resource 
n Speedup 
Res w/o Res Res w/o Res_ | Multiple 
2 45 54 1.20 3 3 1 
3 111 119 1.07 3 3 1 
4 | 7,259 7,815 1.08 13 3 4.33 
5 70 7 10 









































The speedup produced by the reservoir is limited by the delta between completion 
times of the IFG. From Figure 15, we conclude that the trend responsible for a 
significant portion of the maximum delta in completion times is due to using only one 
stage to remove bent functions. An effect of using just one output stage is that a bent 
function will persist 2” < Ppen < 2""'— 1, depending into which stage it is inserted. The 
stage are numbered from 1 to 16 in Figure 15, beginning with the stage that results in 
optimal Py.,;and ending with worst case stage. As n increases, this effect will be reduced 
as bent function become rarer. Figure 15 is a plot of additional clocks required by each 
IFG; after the first IFG completed. This value is given as a percentage of the total clocks 
required for the complete computation. IFGi¢ terminates 1667 clocks after IFG;, which 


is 21.3% of the total clocks consumed. 
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Figure 15. Relative Completion Times of the IFG. 


D. SUMMARY 


The circular pipeline results in a speedup on the order of 2” over the conventional 
architecture used to exhaustively sieve for n-variable bent functions. This speedup is 
achieved with a small fraction of logic resources compared to what is required to achieve 


a similar speedup with the conventional architecture. 


For n = 6, a speedup of 55 times is realized with a resources increase of 2.3 times. 
With the conventional design, a similar speedup would require a logic resources increase 
of 55 times. This is because the only way to increase speedup with the fixed throughput 
of the conventional design is the duplicate the circuit and distribute functions to be tested 
equally between the duplicated circuits. Speedup gained in this way is utilizing 
parallelism; doubling the instances of the circuit doubles the throughput. This method of 
gaining speedup is amenable to the circular pipeline as well. However, for n = 6, 
allocating triple the logic resources of a conventional design and replacing it with the 


circular pipeline will achieve a speedup of 55 times, vice three times. 


In this chapter, the throughput and resource consumption of implemented circular 
pipelines were presented and analyzed. The next chapter concludes this thesis with 


recommendation for further research. 
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VI. CONCLUSIONS AND RECOMMENDATIONS 


A. CONCLUSION 


The circular pipeline architecture was implemented on the SRC-6 and 
demonstrated speedup on the order of 2”. This speedup is realized with a logic resources 
increase of less than threefold for n < 7. For n = 6, the ratio of speedup to logic resources 
increase over conventional architecture is 55:2.3. Previous speedup gains were limited to 
increases in parallelism, which yield a 1:1 ratio of speedup to logic resources 
consumption increase. The circular pipeline is an efficient means of increasing 


throughput in sieving applications. 


The reservoir developed for this thesis provides for the most efficient use of the 
circular pipeline by redistributing functions equitably. However, the delta of run time 
between the IFGs is minor. Therefore, the cost in complexity of the reservoir is not 
worth the speedup gained. Yet, the reservoir could be essential if the circular pipeline is 
applied to other applications without same characteristics of the bent-function sieve 


providing for an even distribution of passed and rejected functions among the stages. 


B. RECOMMENDATIONS FOR FURTHER RESEARCH 


iL: Multiple Output Stages 


The design presented in this thesis was assuming a hard limitation of a single 64- 
bit output bus. This motivated the design to restrict output from a single stage. In order 
to run the circular pipeline on the SRC-6, techniques new to the Naval Postgraduate 
School were implemented. Namely, the use of output streams was critical for the circular 
pipeline’s behavior. While learning the use of output streams, it was realized that up to 
16 1024-bit wide output streams can be used. The streams have a programmable buffer 


mechanism to take care of any bottleneck problems over the 64-bit output bus. Using all 
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16 of these output streams (for n > 4) should be a fairly simple improvement to 
implement. This will result in more LUTs required for the additional stages tasked with 


examining the persistence token, but will improve throughput. 
Zz Pipelined Reservoir 


As noted in Chapter IV, pipelining attempts with the circular pipeline with 
reservoir design failed. However, it may be possible. If the circular pipeline is to be 
applied to other applications, the reservoir will likely be more important, so pipelining it 


to reduce the worst-case path delay could be important. 
3. Multiple FPGAs 


For n = 7, the circular pipeline design does not fit on a single Virtex-I] FPGA. 
Multiple FPGAs must be used for these cases. This is a nontrivial SRC-6 implementation 
issue that will also require modification to the Verilog code. Solving this issue will likely 
have the most impact on the continuing bent-function research at the Naval Postgraduate 


School. 
4. Function Generators 


While this thesis focuses on speedup via hardware design, the most important 
speedups moving forward will be gained by reducing the number of functions that require 
testing. This is the current focus of the continuing bent functions research at the Naval 
Postgraduate School. Understanding special characteristics of bent functions and using 
this understanding to eliminate many of the functions included in an exhaustive test is the 
first step. Building a function generator to produce only these functions is the second 
step. For the circular pipeline produced in this thesis, it is important that the 2” IFG 


produce, on average, functions with the same total number of bent weights. 


This area of research requires in-depth mathematical understanding of bent 
functions as well as ingenuity with Verilog hardware design. In return, it will likely 


produce the most significant results. 
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A. 


APPENDIX. PROGRAMMING CODE 














VERILOG 
1. Circular Pipeline With Independent Function Generators 
MY_CIRC_Pipe.v - An interface between the circular pipeline code that sets up streaming 
with the SRC-6. Based on the SRC example user_one_stream. 
Created: August 7, 2010 
Last Modified: September 3, 2010 
Author: Chris Johnson 
Notes: DATA_OUT bus width is not parameterized; must be manually edited for n>5. 
modDATA_OUT must be edited for n>6. 
Sub-module calls: CircPipe.v 











module MY_CIRC_PIPE ( 




















input START, 
input CLK, 

input CLR, 
output reg DONE, 
output reg [S8dc2.0)] DATA_OUT, 
output reg VALID_OUT, 
input STALL_IN, 
output reg TERM_OUT 


i 
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//parameter names for the states 





reg 


wire 
wire 
wire 
wire 


always @* 






























































localparam IDLE = 0; 
localparam ACTIVE = 1; 
localparam STALLED = 2; 
localparam FINISHING = 3; 
[1:0] state; 
//wire connections from module call 
modDONE; 
[63:0] modDATA_ OUT; 
modVALID_OUT; 
modTERM_OUT; 
if (CLR) begin 
DATA_OUT <= 0; 
DONE <= 0; 
VALID_OUT <= 0; 
TERM_OUT <= 0; 
state <= 0; 
end 
else 
case (state) 
IDLE: if (START) begin 
DATA_OUT <= 0; 
VALID_OUT <= 1; 
state <= ACTIVE; 
end 
ACTIVE: begin 
DATA_OUT <= modDATA_OUT; 
DONE <= modDONE; 
VALID_OUT <= modVALID_OUT; 
TERM OUT <= modDONE; 
state <= ACTIVE; 
if (STALL_IN) begin 
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VALID_OUT <= 0; 





state <= STALLED; 
end 
end 
STALLED: if (~STALL_IN) begin 





VALID_OUT <= 1; 
state <= ACTIV 





GJ 


end 


FINISHING: begin 
state <= IDL 





{| 


end 


default:; 
endcase 





CircPipe ul (START, CLK, CLR,modDONE, modDATA_OUT,modVALID_OUT, STALL_IN) ; 











endmodule 

// 

// CircPipe.v - The circular pipeline with independent function generators top level module. 
// 

// Created: December 22, 2009 

// Author: Jon T. Butler 

// Last Modified: September 3, 2010 

// Modified by: Chris Johnson 

// 

// Notes: Set parameter ‘n’ in this file. It is passed to all sub-modules. 
// 

// Called by: MY_CIRC_PIPE.v 

// 

// Sub-module calls: countersMod.v 

// Stage_TT.v 

// 
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// This implements the circular pipeline. For n-variable functions, there are N = 2**n stages, 
























































// one for each linear function (we need only compare against the linear functions, since a 
// function that has a bent distance from all linear function, has a bent distance away from 
// all affine functions. In this realization, only one stage has a bent function output - to 
// simplify the circuit. In this way, the circular pipeline serves as a buffer. In this 
// case a bent function will go through from N to 2N-1 stages. 
jf 
// 
// 
module CircPipe #(parameter n=6, parameter N=2**n) //n is number of variables. N is of bits in func’s TT. 
( input START, 
input CLK, 
input CLR, 
output reg done, //Asserted when all counters are done & pipe empty. 
output [63:0] BENT, 
output valid_out, // Indicates a valid bent function is at BENT. 
input STALL_IN 
i 
wire N-1:0 countDone; // Set when counter has completed one cycl 
wire N-1:0 LIN_FNC N-1:0]; 
wire N-1:0 REJECT; // 0 bit indicates FNCS word not accepted. 
reg temp; 
wire N-1:0 FNCS N-1:0]; // Bach of the N words in counter FNCS has N bits. 
wire n-1:0 FNCShob N-1:0]; // High order bits for the counter 
wire N-n-1:0] counter N-1:0]; // N simple counters, extra bit to signal counter is done 
wire N-1:0 to_stage; 
wire N-1:0 stage_TT N-1:0]; 
wire nt+1:0 no_passes N-1:0]; 





























genvar g; 


TITTTTTTTTTTTTTTT TTT TTT ATTA TTT TATA TTT TATA TTT TTT TTA TTT TTT TAT TTT TATA TTT 








////CREATE INDEPENDENT FUNCTION GENERATORS (IFG)///////////////////// 
////Instantiate independent counters for function gens/////////////// 
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generate 
for (g=0; g<N; g=gt1) 
begin: CountersGen 
countersMod #(.n(n)) u4(START,CLK, CLR, STALL_IN, REJECT[g],counter[g],countDone[g]); 














end 
endgenerate 


generate 

//Generate high order bits 

for (g=0; g<N; g=gt1) 
begin: CounterHOB 


assign FNCShob[g] = g; 
end 
endgenerate 
//Generate counters 
generate 


for (g=0; g<N; g=gt1) 
begin: CounterConcat 
assign FNCS[g] = countDone[g] ? {N{l'bO}} : {FNCShob[g],counter[g] }; 
end 
endgenerate 
////CREATE INDEPENDENT FUNCTION GENERATORS (IFG)///////////////////// 





























////TERMINATION SIGNAL/////////////0/7771TTTTITITTTTTTTTAA ATTA TATA TT 





always@* 
if (countDone[N-1:0] == {N{1'bl}} && to_stage[N-1:0] == {N{1'b0O}}) 
done <= 1'bl; 
else 


done <= 1'b0; 
////TERMINATION SIGNAL////////////////////01//01001071STISTISTSTSTS TST ST 











////GINEAR FUNCTIONS///////////////////7//0/0//0/01/TSTTITISTSTTTSS TTT 
generate 
for (g=0; g<N; g=gt1) 
begin: LinearGen 
assign LIN_FNC[g] = Linear(g); 
end 
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endgenerate 


function [N-1:0] Linear(input [n-1:0] Y); 
integer j; 
integer k; 
reg [n-1:0] X; 





begin 
for (j=0; J<N; j=jt1) 
begin 
X= J; 
temp=0; 
for (k=0; k<n; k=k+1) 
begin 
temp = temp * (X[k] & Y[k]); 
end 
Linear[N-1-X] = temp; 
end 
end 


endfunction 


////GINEAR FUNCTIONS//////////////7//00000TTTLTTTTTTTTTAT ATTA TTA TAT TT 














////INSTANTIATE STAGES///////////////0000700TTTTTTTTITTATTTAT TATA ATTA TTT 


generate 
for (g=0; g<N; g=gt1) 
begin: Stages 





























if(g != 0) begin 
stag (.n(n)) u2(CLK, FNCS[g], REJECT[g], to_stage[g-1], to_stage[g], stage_TT[g-1], 
LIN_FNC[g], stage_TT[g], no_passes[g-1], no_passes[g], countDone[g]); 
end 
if(g == 0) begin 














stagel #(.n(n)) u3(CLK, FNCS[g], REJECT[0], to_stage[N-1], to_stage[0], stage_TT[N-1], 
LIN_FNC[0], stage_TT[0], no_passes[N-1], no_passes[0], countDone[g], BENT, valid_out); 
end 
end 
endgenerate 


////INSTANTIATE STAGES///////////////000000TTTTTTTTTTTTTT TATA TTA ATTA TT TT 
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endmodule 


























// 
// countersMOD.v - Instantiates an inhabitable counter. 
// 
// Created: August 11, 2010 
// Ruthor: Chris Johnson 
// Last Modified: September 3, 2010 
// 
// Notes: This counter is the lower N-n-1 bits of the function gen in CountersMod.v. 
// 
// Called by: CountersMod.v 
// 
// Sub-module calls: None 
kif 
// 
// 
module countersMod #(parameter n = 6, parameter N=2**n) 
( input START, 
input CLK, 
input CLR, 
input STALL_IN, 
input REJECT, 
output reg [N-n-1:0] counter, 
output reg countDone 


i 
reg [1:0] state = 0; 


always@(posedge CLK, posedge CLR) 
i1f(CLR) begin 
countDone <= 0; 
counter <= 0; 
state <= 0; 
end 
else 
case (state) 
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O: if (START) begin 
counter <= 0; 
state <= 1; 
countDone <= 0; 
end 
1: begin //counter active 
if (!REJECT && !STALL_IN) 
counter <= counter + 1; 
if (counter == 2** (N-n)-1) 
begin 
state <= 2; 














end 
end 
2: begin //counter complete 
countDone <= 1'bl; 
counter <= {N{l'bO}}; 
state <= 0; 





end 
default:; 
endcase 

endmodule 
// 
// stage.v -— One (simple) stage only. 
// 
// Created: December 22, 2009 
// Author: Jon T. Butler 
// Last Modified: September 3, 2010 
// Modified by: Chris Johnson 
// 
// Notes: This does NOT put out a bent function. 
fi 
// Called by: CountersMod.v 
// 
// Sub-module calls: test_for_bent.v 
// 
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// 





// 


module stage #(parameter n=6, 


test_for_bent #(.n(n)) 
and stgs(pass,passU1,valid) ; 


always@* //Ca 


if (to 


else 


c 
ct 
io) 
ct 


c 
ct 
c'O 
ct 


c 
ct 
oO 
ct 











ct 


eT 
O 
chee er creche chor 








~~ HOF OKRKH OKO BEA 


~e 


reg 


reg 


reg 


next_stage_in==1) 








REJECT 

















REJECT 





<= 1; 


<= 0; 


always@ (posedge CLK) 


if (to 








else 


begin 


end 


begin 


next_stage_in==1) 






































parameter N=2**n) //n is number of variables. N is of bits in func’s TT. 
CLK, 
[N-1:0] FNCS_TT_in, 
REJECT, 
to_next_stage_in, 
pass, 
a stage_TT_in, 
2 LIN_FNC, 
stage_TT_out, 











DIO vO Or@ 





no_passes_in, 
no_passes_out, 


countDone 


//output pass signal 


ul (stage_TT_out, LIN_FNC,passUl1); 


if input is valid and TT passes 





n prune this signal and just use to_next_stage_in 





//Data to this 


stage_TT_out <= stage_TT_in; 


valid <= 


no_passes_out <= no_passes_in + 


//Data to this stage comes in 


stage_TT_out <= FNCS_TT_in; 


valid <= 


!'countDone; 


stage comes from previous stage. 


1; 


from input buffer. 


//valid iff counter is not yet done 
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no_passes_out <= 0; 
end 
endmodule 


ee Se 





// n= 2 4 6 8 10 12 

// Freq. 181.8 144.8 1BEO 53.4 42.9 3559 

// #LUTs (%) 16 (0%) 67 (0%) 304(0%) 1251(1%) 5384(7%) 22179 (32%) 
// Reg.Bits not i/o 4(0%) 23 (0%) 77(0%) 283(0%) 1037(1%) 4352 (6%) 


TITTTTTTTTTTTTTTTTT TTT TATA TTT TTT TATA TTT TATA ATTA TATA TTT TTT TATA TTT TTT 















































// 

// stagel.v One stage only. 

// 

// Created: December 22, 2009 

// Author: Jon T. Butler 

// Last Modified: September 3, 2010 

// Modified by: Chris Johnson 

// 

// Notes: This does put out a bent function. 

// 

// Called by: CountersMod.v 

// 

// Sub-module calls: test_for_bent.v 

// 

// 

// 

module stagel #(parameter n=6, parameter N=2**n) //n is number of variables. N is of bits in func’s TT. 

( 

input CLK, 
input [N-1:0] FNCS_TT_in, 
output reg REJECT, 
input to_next_stage_in, 
output reg to_next_stage_out, 
input [N-1:0] stage_TT_in, 
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input [N-1:0] LIN_FNC, 
output reg [N-1:0] stage_TT_out, 
input [n+1:0] no_passes_in, 
output reg [nt1:0] no_passes_out, 
input countDone, 
output reg [N-1:0] BENT, 

output reg valid_out 


i 


wire passUl; 
reg valid; 


test_for_bent #(.n(n)) ul (stage_TT_out, LIN_FNC, passU1) ; 
and stgsl1(pass,passUl,valid); //output pass signal if input is valid and TT passes 








always@* to_next_stage_out <= (pass && (no_passes_out < N)); 





























alwayse@* 
if (to_next_stage_in==1) 
REJECT <= 1; 
else 
REJECT <= 0; 


always@ (posedge CLK) 
if (no_passes_out >= N) 














begin 
BENT <= stage_TT_out; 
valid_out <= 1; 
end 
else 
begin 
BENT <= {N{1'bO}}; 
valid_out <= 0; 
end 
always@ (posedge CLK) 
if (to_next_stage_in==1) //Data to this stage came from previous stage. 
begin 


mi 


stage_TT_out <= stage_TT_in; 


no_passes_out <= no_passes_in + 1; 


valid <= 1; 


end 
else 
begin 
stage_TT_out <= FNCS_TT_in; 
no_passes_out <= 0; 
valid <= !countDone; 
end 
endmodule 


//valid iff counter is not done 


TITTTTTTTTTTTTTTTTT TTT TTT ATT TTT TTT TTT TTT TATA TATA TTT TATA TATA TT TT 








// 
// test_for_bent.v - Compares nonlinearity with the two possible bent weights for n. 
// 
// Created: December 22, 2009 
// Author: Jon T. Butler 
// Last Modified: September 3, 2010 
// Modified by: Chris Johnson 
// 
// Notes: Nonlinearity is returned from Ones_Count.v 
// 
// Called by: stage.v 
// stagel.v 
// 
// Sub-module calls: Ones_Count.v 
// 
// 
// 
module test_for_bent #(parameter n=6, parameter N=2**n) //n is number of variables. N is 
( 
input N-1:0 TT: 
input N-1:0 LIN_FNC, 
output reg pass 


// 
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of bits. fin TT. 


parameter n = 6; // 1 number of variables 


localparam N = 2%**n; // N number of bits in truth table of an n-variable function. 
// 

reg [N-1:0] Ham_dist; 

wire fne0:] Count; 


always @* 


begin 
Ham_dist = TT_in * LIN_FNC; 
if (Count == 2**(n-1) - 2**(n/2-1) || Count == 2**(n-1) + 2**(n/2-1)) 
pass = 1; 
else 
pass = 0; 
end 


// 

Ones_Count u2 (Ham_dist, Count); 
defparam u2.n =n; 

// 


endmodule 


ee Ce 





// n= 2 4 6 8 10 
// Freq. 140.8 94.1 DDD 44.0 39.65 
// #LUTs (%) 5 (0%) 46 (0%) 219(0%) 949(1%) 3421(3%) 


TITTTTTTTTTTTTTTTTT TTT TTT TTA TTT TTT TTT TTA TTT TTT TATA TATA 


TITTTTTTTTTTTTTLTTT TTT TTT ATTA TTT TATA TATA TTA 


module Ones_Count (TT, Count); 





// 

// Ones_Count.v - A program to count the number of 1's in HD (Hamming Distance), producing that 
// count at Count. This version of Ones_Count.v uses functions. 

// 

// Created: August 18, 2007 


// Last Modified: December 26, 2009 
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Author: Jon T. Butler 

Inputs: LT, 

Outputs: Count 

Notes: 1. For n=2, this circuit builds a 4-input 3-output 1s count circuit that is intended to 








mak fficient use of the 4-input LUTs in the SRC's FPGA. 





parameter n= 10; // At n=6, freq = 79.9 MHz. and it does not compile at n=7. 

localparam N = 2**n; 

output [n:0] Count; 

input [N-1:0] Tye 

reg [n:0] Count; // If Count is wire, ModelSim complains of "illegal reference to net 
// Count" below. I believe it is because Count should be declared a 
// veg, per discussion on p. 178 of Palnitkar. Unfortunately, this 
// is not a combinational logic circuit. Using 'task' does not seem 
// to help. Both input and output variables must be reg. 











always @(TT) 
begin: CHECK_n 
case (n) 








2: Count <= Count2 ( 
3: Count <= Count3 ( 
4: Count <= Count4(T 
5: Count <= Count5(TT 
6: ( 
7 ( 
8 ( 

( 


~‘e 





~e 








ooNe 





Count <= Count6 
Count <= Count7 


~e 





~“e 





: Count <= Counts 
9: Count <= Count9(T 
10:Count <= Count10 (TT 
11:Count <= Countl1l1 (TT 
12:Count <= Count12 (TT 


~e 








Re aA Ra ee ea oe 
x 








oo™Ne 




















—-rYwrTe~ 
x 





~e 
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J [RRR RRR KKK KK KK KR I I I I OK OK RK \/ 


//***** The 1's count function - Count10 for 12-variable functions Ka KKKN / 
function [12:0] Count12; 
input [4095:0] TT; 


begin: £12 
Count12 = Counti11(TT[4095:2048]) + Count11(TT[2047:0]); 
end 





endfunction 


//***** The 1's count function - Count12 for 12-variable functions KKK RN / 
J [RRR RRR KKK KK KK OK KK OR I I I OR A KK  \/ 


J [RRR KR KK KKK KK KK KKK I IR I I I A OR A KK Ke / 


[/[****** The 1's count function - Countll for 11-variable functions KKKKKN\ / 
function [11:0] Count1l1; 
input [2047:0] TT; 


begin: f11 
Countll = Count10(TT[2047:1024]) + Count10(TT[1023:0]); 
end 
endfunction 
//****** The 1's count function - Count9 for ll-variable functions PRO NT 


J [RRR KKK KKK KKK KK KK RR I RR A RO KK KK \/ 
J [RRR RRR KKK KK KK KK I I I I I OR A RK \/ 


//***** The 1's count function - Count10 for 10-variable functions ag 
function [10:0] Count10; 
input [1023:0] TT; 





begin: £10 
Count10 = Count9(TT[1023:512]) + Count9(TT[511:0]); 
end 
endfunction 
//***** The 1's count function - Count10 for 10-variable functions REE NG 


J [RRR RRR KKK KKK KK KK I I I I OR OR KK KK \/ 
J [RRR RRR KKK KK KK I KR I IR I OR KK / 
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//****** The 1's count function - Count9 for 9-variable functions RAK IAK Ne 
function [9:0] Count9; 
input [511:0] TT; 


begin: £9 
Count9 = Count8(TT[511:256]) + Count8(TT[255:0]); 
end 
endfunction 
//****** The 1's count function - Count9 for 9-variable functions ER EIEN 


J [RRR RRR KKK KK KR KK I I I I I I RK KK / 
J [RRR RRR KKK KK KK KK I I I I I OR KK KK \/ 


//****** The 1's count function - Count7 for 7-variable functions EARLE 
function [8:0] Count8; 
input [255:0] TT; 








begin: £8 
Count8 = Count7(TT[255:128]) + Count7(TT[127:0]); 
end 
endfunction 
//****** The 1's count function - Count7 for 7-variable functions AK IKE Ne 


J [RRR RRR KR KKK KK KK KK I I I I I OR I A I OK  \/ 
J [RRR RRR KKK KK KR KK I RR I RR A OR OR OK OK KK / 


//****** The 1's count function - Count7 for 7-variable functions BREAN 
function [7:0] Count7; 
input [127:0] TT; 








begin: £7 
Count7 = Count6(TT[127:64]) + Count6(TT[63:0]); 
end 
endfunction 
//****** The 1's count function - Count7 for 7-variable functions RERRAEN f 


J [RRR RRR KKK KK KK KK I IR I I I I OR A KK  \/ 


J [RRR RRR KKK KK KK OK I I I I I OR KK KK \/ 


//****** The 1's count function - Count6 for 6-variable functions FRR RAEN 
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function [6:0] Count6; 
input [63:0] TT; 
begin: f6 
Count6 = Count5(TT[63:32]) 
end 
endfunction 


[[RRKKKK 


The 1's count function 


+ Count5(TT[31:0]); 


—- Count6 for 6-variable functions 


KKK KK / 


J [RRR RRR KK KK KKK KK KK IK I I I I I I OR OK KK \/ 


J [BRK RRR KKK KK KK KK I I I I I OR OR A KR KK / 


[[RRKKKK 








The 1's count function 


function [5:0] Count5; 
input [31:0] TT; 
begin: £5 
Count5 = Count4 (TT[31:16]) 
end 
endfunction 


[[RRKKKK 


The 1's count function 


—- Count5 for 5-variable functions 


+ Count4(TT[15:0]); 


- Count5 for 5-variabl 





KKK KK \ / 


le functions KKKKKK\ / 


J [RRR RRR KKK KK KK KK I I I I I IR OR OK KK \/ 


J [RRR RRR KKK KK KK KK I I I I I OR I KK KK / 


[[RRKKKK 


The 1's count function 





function [4:0] Count4; 
input [15:0] TT; 
begin: f4 
Count4 = Count3(TT[15:8]) + 
end 
endfunction 


[[RRKKK 


The 1's count function 





—- Count4 for 4-variabl 


Count3(TT[7:0]); 


—- Count4 for 4-variabl 


le functions KRKKKK\ / 


le functions KRKKKK\ / 


J [RRR KKK KKK KK KK KK I RI I I RR KK KK \/ 


J [RRR RRR KKK KK KKK KR IK I I I I I OR I KK / 


[[RRKKKK 


function 





The 1's count function 
[3:0] Count3; 





—- Count3 for 3-variabl 


le functions KKKKKK\ / 
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input [7:0] TT; 


begin: £3 
Count3 = Count2(TT[7:4]) + Count2(TT[3:0]); 
end 
endfunction 
//****** The 1's count function - Count3 for 3-variable functions Se ENG | 


J [RRR RRR KKK KK KK KK I RR I I I I RK KK / 
J [RRR RRR KKK KK KK KK I I RR I I I OR OK KK / 


[//****** The 1's count function - Count2 for 2-variable functions KEKKKKN / 
function [2:0] Count2; 
input [3:0] TT; 








begin: £2 
Count2 [OJ=TT[3]*TT[2]*TT[1]*TT[0]; 











Count2[1]=(TT[3]&TT[2] |TT[3]&IT[1] |TT[3]&TT[0] |TT[2]&TI[1] | TT [2] &TT[O] | TT[1] &TT[0]) &~(TT[3] &TT[2]&TT[1]s&IT[O 
]); 























Count2[2]=TT[3]&TT[2]&TT[1]&IT[0]; 


end 
endfunction 
//****** The 1's count function - Count2 for 2-variable functions RE RIKEN | 


J [RRR RRR KK KK KKK KK KK IK I I I I AO KK KK / 





ee Ce 


// n= 2 4 6 8 10 

// Freq. 149.9 96.7 Wisin 47.6 38.7 

// #LUTs (3%) 3 (0%) 32 (0%) 71 (0%) 595(0%) 2296(3%) 

endmodule 

LITITILTLTLTILT ALAA ITALIA TIT ITIAI AAT TATA AAAI TT 
////// 


TITTTTTTTTTTTTTTT TTT TTT ATTA TTT TTT TTT TATA TATA TTT 
//I/// 
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TITTTTTTTTTTTTTTTTT TTT TTT TTT TTT TTT ATTA TATA TTT TATA TTT 
////// 
TITTTTTTTLTTTTTTTLTL TTT TTT TTT TTT TTT TTT TTT ATTA 
////// 


ee Se 




















// n= 2 3 4 = 6 

// 

//nonlinearity - over all functions/rot. sym. func./symmetric func. 

// 0 8/4/4 16/4/4 32/ 4/ 4 64/ 4/ 4 2/ 4/ 4 
// 1 8/4/4 128/8/8 512/ 8/ 8 2048/ 8/ 8 2/ 8/ 8 
// 2 0/0/0 112/4/4 3840/ 8/ 4 31744/ 4/ 4 2/ 8/ 4 
// 3 0/0/0 0/0/0 17920/ 8/ 0 317440/ 0O/ 0 2/7 16/7 0 
// 4 0/0/0 0/0/0 28000/12/ 4 2301440/ O/ 0 ?/  20/ 0 
if 5 0/0/0 0/0/0 14336/16/ 8 12888064/24/ 8 ?/  16/ 0 
// 6 0/0/0 0/0/0 896/ 8/ 4 57996288/48/ 16 ?/ = 56/ 8 
// 7 0/0/0 0/0/0 O/ O/ O 215414784/24/ 8 ?/ 88/ 16 
// 8 0/0/0 0/0/0 0/ O0/ 0 647666880/ 0/ O ?/ = 80/ 8 
// 9 0/0/0 0/0/0 0/ O/ 0 1362452480/ O0/ 0 ?/ 152/ 0 
// 10 0/0/0 0/0/0 0/ O/ 0 1412100096/36/ 4 ?/ 184/ 0 
// 11 0/0/0 0/0/0 0/ O/ 0 556408832/72/ 8 ?/ 144/ 0 
// 12 0/0/0 0/0/0 0/ O/ 0 27387136/36/ 4 ?/ 324/ 4 
// 13 0/0/0 0/0/0 0/ O/ 0 0/ O/ O ?/ 432/ 8 
// 14 0/0/0 0/0/0 0/ O/ 0 0/ O/ 0 ?/ 360/ 4 
ies 15 0/0/0 0/0/0 0/ O/ 0 0/ O/ 0 ?/ 648/ 8 
// 16 0/0/0 0/0/0 0/ O/ 0 0/ O/ 0 ?/ 832/ 8 
// 17 0/0/0 0/0/0 0/ O/ 0 0/ O/ O ?/ 768/ 0 
// 18 0/0/0 0/0/0 0/ O/ 0 0/ O/ 0 2?/1076/ 0 
// 19 0/0/0 0/0/0 0/ O/ 0 0/ O/ O 2/1304/ 0 
// 20 0/0/0 0/0/0 0/ O/ 0 0/ O/ 0 ?/1232/ 0 
// 21 0/0/0 0/0/0 0/ O/ 0 0/ O/ O 2715367) ~ 16 
// 22 0/0/0 0/0/0 0/ O/ 0 0/ O0/ 0 2?/1924/ 16 
// 23 0/0/0 0/0/0 0/ O/ 0 0/ O/ 0 2?/2232/ 0 
// 24 0/0/0 0/0/0 0/ O/ 0 0/ O/ O ?/1612/ 0 
// 25 0/0/0 0/0/0 0/ O/ 0 0/ O/ 0 2f T52/ 0 
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0/0/0 0/0/0 0/ 0/7 O 0/ 0/7 0 ?/ 432/ 4 
0/0/0 0/0/0 0/ O/ O 0/ 0/7 O 2?/  96/ 8 
0/0/0 0/0/0 0/ 0/7 O 0/ 0/7 O 2?/  48/ 4 
0/0/0 0/0/0 0/ 0/7 0 0/ 0/7 O 2/ 0/ 0 
0/0/0 0/0/0 0/ 0/7 O 0/ 0/7 O 2 0/ 0 

lues for ALL functions for n = 6 were not obtained, since this computation 

takes more than 5000 years at 100 MHz.. 
lues for ROT. SYM. functions for n = 7 were not obtained because, after 





15 hours of compilation time, Synplify Pro issued an "Out-of-—Memory" 
error message. 
lues for SYMMETRIC functions for n = 7 were not obtained because, after 
15 hours of compilation time, Synplify Pro issued an "Out-of-—Memory" 
error message 
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2. Circular Pipeline With Reservoir 


Modules identical to those in the circular pipeline with IFGs (code in section 1) are not replicated in this section. 












































// 
// MY_CIRC_Pipe.v - An interface between the circular pipeline w/reservoir code that sets up 
If streaming with the SRC-6. Based on the SRC example user_one_stream. 
// 
// Created: August 20, 2010 
// Last Modified: September 3, 2010 
// Ruthor: Chris Johnson 
// 
// Notes: DATA_OUT bus width is not parameterized; must be manually edited for n>5. 
// modDATA_OUT must be edited for n>6. 
// 
// Sub-module calls: CircPipe.v 
ks 
// 
module MY_STREAM TEST ( 
CNT, 
START, 
CLK, 
CLR, 
DONE, 
DATA_OUT, 
VALID_OUT, 
STALL_IN, 
TERM_OU 
i 
input 31:0] CNT; 








input CLK /* synthesis syn_noclockbuf=1 syn_maxfan=100000 */; 
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output [31:0] DATA_OUT; 
output VALID_OUT; 
input STALL_IN; 
output TERM_OUT; 
// output [N-n-1:0] COUNTER; 








reg [31:0] DATA_OUT; 
reg VALID_OUT; 

reg TERM_OUT; 

reg DONE; 

reg [1:0] state; 








parameter IDLE = 
parameter ACTIVE = 
parameter STALLE 

parameter FINISHING 








0 
ll 
WNRO 
s 





wire modDONE; 
wire [63:0] modDATA_ OUT; 
wire modVALID_OUT; 

wire modTERM_OUT; 














always @*// (posedge CLK or posedge CLR) 
if (CLR) begin 























DATA_OUT <= 0; 
DONE <= 0; 
VALID_OUT <= 0; 
TERM_OUT <= 0; 
state <= 0; 
//COUNTER <= 0; 
end 
else 
case (state) 
IDLE: if (START) begin 


DATA_OUT <= 0; 
VALID_OUT <= 1; 
// COUTN 
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Gl 


state <= ACTIV 



































end 
ACTIVE: begin 
DATA_OUT <= modDATA_OUT; 
DONE <= modDONE; 
VALID_OUT <= modVALID_OUT; 
TERM OUT <= modDONE; 
if (STALL_IN) begin 
VALID_OUT <= 0; 
state <= STALLED; 
end 
end 
STALLED: if (~STALL_IN) begin 
VALID_OUT <= 1; 
state <= ACTIVE; 
end 


FINISHING: begin 








//DONE <= 0; 
state <= IDLE; 
end 
default:; 


endcase 


CircPipe u2 (START, CLK, CLR,modDONE, modDATA_OUT,modVALID_OUT, STALL_IN,modTERM_OUT) ; 





endmodule 





// 


// CircPipe.v - The circular pipeline with independent function generators top level module. 


// 
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// Created: December 22, 2009 
// Author: Jon T. Butler 
// Last Modified: September 3, 2010 
// Modified by: Chris Johnson 
// 
// Notes: Set parameter ‘n’ in this file. It is passed to all sub-modules. 
// 
// Called by: MY_CIRC_PIPE.v 
// 
// Sub-module calls: countersMod.v 
// Stage_TT.v 
// 
// This implements the circular pipeline. For n-variable functions, there are N = 2**n stages, 
// one for each linear function (we need only compare against the linear functions, since a 
// function that has a bent distance from all linear function, has a bent distance away from 
// all affine functions. In this realization, only one stage has a bent function output - to 
[i simplify the circuit. In this way, the circular pipeline serves as a buffer. In this 
// case a bent function will go through from N to 2N-1 stages. 
// 
// 
// 
module CircPipe #(parameter n=5, parameter N=2**n) 
( input START, 

input CLK, 

input CLR, 

output done, 

output [63:0] BENT, 

output valid_out, 

input STALL_IN, 

output reg term_out 

i 

wire [N-1:0] LIN_FNC [N-1:0]; 
wire [N-1:0] REJECT; // 0 bit indicates FNCS word not accepted. 
reg temp; 
wire INHIBIT; // signal from the queue to pause counters 
wire [N-1:0] FNCS [N-1:0]; // Each of the N words in counter FNCS has N bits. 
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wire N*N-1:0] FNCS_1d; 

wire N*N-1:0] QUEUE; 

wire n-1:0] FNCShob [N-1:0]; 
wire [N-n-1:0] counter; 

wire N-1:0] to_stage; 

wire N-1:0] stage_TT [N-1:0]; 
wire nt+1:0] no_passes [N-1:0]; 


genvar g; 


TITTTTTTTTTTTLTTTLTT TTT TTT TTA TTT TTT ATTA TTT TTT TTT TATA ATTA ATTA ATT TTT TTT TT ATT TTT TT 








//for connection to queue module 
//output of reservoir queue 

//nigh order bits for the counter 
//simple counter, 


IITITITIII//////®ONCTION GENERATOR//////////////701111TTTTTTTSTTTTT TTT 











//instantiate a single counter 
countersMod #(.n(n)) 


u4 (START, CLK, CLR, STALL_IN, INHIBIT, counter, done) ; 


{FNCShob[g],counter[N-n-1:0]}; 


generate 
//Generate high order bits 
for (g=0; g<N; g=gt1) 
begin: CounterHOB 
assign FNCShob[g] = g; 
end 
endgenerate 
//Generate function generators 
generate 
for (g=0; g<N; g=gt1) 
begin: CounterConcat 
assign FNCS[g] = 
end 
endgenerate 
//Create 1-d version of function generators for i/o interface 
generate 
for (g=0; g<N; g=gt1) 
begin: FNCS1ld 
assign FNCS_ld[g*N+N-1l:g*N] = 
end 
endgenerate 
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extra bit to signal counter is done 





IITITIIIISII/////8ONCTION GENERATOR//////////////010711TTSTSTTSTSTTTT TT 











I/II/1//////GINEAR FUNCTIONS///////////////////0////00/011TTISITTSTTSAT ST 
generate 
for (g=0; g<N; g=gt1) 
begin: LinearGen 
assign LIN_FNC[g] = Linear(g); 





end 
endgenerate 


function [N-1:0] Linear(input [n-1:0] Y); 
integer j; 
integer k; 
reg [n-1:0] X; 





begin 
for (j=0; J<N; j=jtl) 
begin 
X= J; 
temp=0; 
for (k=0; k<n; k=k+1) 
begin 
temp = temp * (X[k] & Y[k]); 
end 
Linear[N-1-X] = temp; 
end 
end 


endfunction 


JIIII///////GINEAR FUNCTIONS////////////000700TTTTTTTTTTTTTTT TTT TTT TTT TTT 








II/III/ S11 / /RESERVOR/QUEUE//////////////7///71170TSSITTSTTT SST SATA SATS ST 
CircPipeQue #(.n(n) ) QueModule (CLK, FNCS_1ld, REJECT, INHIBIT, QUEUE); 
IIIIII/ S11 / SRESERVOR/QUEUE//////////////0///01110TSSITSSTTI STIS ST 















































JITITSSSSSTISTAGESSSSSSSSSSTSI SSSI SISAL SISTA STATA TTA SSAA TTT TT 
generate 
for (g = 0; g<N; g=gt1) 
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begin: Stages 










































































if(g != 0) begin 
stag (.n(n)) u2(CLK, QUEUE[g*N+N-1:g*N], /*VALID_IN[g],*/ REJECT[g], to_stage[g-1], 
to_stage[g], stage_TT[g-l1], LIN_FNC[g], stage_TT[g], no_passes[g-1l], no_passes[g]); 
end 
if(g == 0) begin 
stagel (.n(n)) u3 (CLK, QUEUE [N-1:0], /*VALID_IN[0],*/ REJECT [0], to_stage[N-1], 
to_stage[0], stage_TT[N-1], LIN_FNC[0], stage_TT[0], no_passes[N-1], no_passes[0], BENT, valid_out); 
end 
end 
endgenerate 





XC) a 


endmodule 








// CircPipeQue.v 


Reservoir and queue for circular pipeline. 








// Created: March 30, 2010 
// Author: Chris Johnson 
// Last Modified: September 3, 2010 
// 
// Notes: None 
// 
// Called by: CircPipe.v 
// 
// Sub-module calls: pri_enc.v 
// thermo_adder.v 
// 
// 
// 
module CircPipeQue # (parameter n=3, parameter N=2**n) 
( input CLK, 
input [N*N-1:0] gen_l, 
input [N-1:0] reject, 
output inFromRes, //stall function generator 
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output reg [N*N-1:0] queue 
\; 
localparam SHAMT_WIDTH n+1; //number of bits for shamt. n is enough to hold the max transfer distance 
wire N-1:0] inToPipe; 
reg N-1:0] in N-1:0]; //Output of MUX that selects candidates for pipeline 
wire N*N-1:0] nave // 1-d version of in 
reg N-1:0] res 2*N-2:0]; //extra reg for pipelining 
wire N* (2*N-1)-1:0] reswire; 
reg SHAMT_WIDTH:0] shamt N:0]; //shift amount using to route TT's into "res" 
wire Bx 2 * (2*n-1)-2** (n=) 2 £03) shamt_sel; //translate shamt into sel lines for use in pri_enc 
//vector width is equivalent to sum(2%*n, 2% (nt+1)-1) 
wire N-1:0] out 2AN=22:0)'3 
reg N-1:0] gen N-1:0]; //2-D version of Func Gen inputs 
reg 2*N-2:0] OCC; //occupied marker bits, one for each reservoir and "in" function 
wire n-1:0] thermosSum; 
reg N-2:0] thermo_occ; //occupied bits routed to thermoSum (either middle or lower 3 occ bits) 





genvar i, Jj; 


//Transform Func Gen's TT's to 2-D arrays 








generate 
for (i=0; i<N; i=it1) 
begin: multidim 
always@*// (posedge CLK) //Pipeline function generator 
begin 
gen[i] <= gen_1[N*i+N-1:N*i]; 
end 
end 
endgenerate 
always@* //MUX to select which source of functions to provide to CircPipe 
if (inFromRes) 
queue <= reswire[N*N-1:0]; 
else 


queue <= gen_l; 
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//to output to the testbench 
generate 
for(i=0; i < 2*N-1; i=i+1) 
begin: ReswireOutput 
assign reswire[i*N+N-1:N*i] = res[il; 





end 
endgenerate 


J [RRR RRR KKK KK KK KK KK I I I I I I I I A A A I I A RK OK 


//Create select lines from shamt 
generate 
for(i=0; i<N; i=it+1) 
begin: inlD 
assign in_1[i*N+N-1:i*N] = in[il]; 





end 
endgenerate 


generate 
for (i=0; i<2*N-2; i=i+1) 
begin: shamt_sel_gen 
if (i<N) 
begin 
for (4j=0; 4<N; j=3+1) 
begin: sham_sel_gen_innerl 
assign shamt_sel[i*Nt+ 3] = (shamt [N-j]==i && !inToPipe[j]) ? 1l'bl : 1'b0; 
end 
end 


else// if (i<2*N-2) 
begin 
for(j=0; 43<2*N-1-i; j=j+1) 
begin: shamt_sel_gen_inner2 
assign shamt_sel[shamt_idx(i)+j] = (shamt[2*N-1i-1-j]==i €& !inToPipe[j+i-N+1]) ? 
Tb arch 0 
end 
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end 
endgenerate 
J [RRR RRR KKK KK KK KK I I I I I I I A I I A A RK OK 
//SECTION ONE: INPUT CONTROL AND RESERVOIR 
SITITLSTTITTTSS STITT ASST TTI STATA 


assign inFromRes = occ[N-1]; 




















//Select input from either func gen or reservoir 
generate 
for (i=0; i<N; i=i+1) 
begin: incoming 
always@*//(inFromRes, res[i], gen[i]) 





begin: A 
in[{i] <= inFromRes ? res[i] : gen[i]; 
end 
end 
endgenerate 


//Calculate shamt from reservoir 

always@* thermo_occ <= inFromRes ? occ[2*N-2:N] : occ[N-2:0]; 
thermo_adder #(n) thermo (thermo_occ,thermoSum) ; 

always@* shamt[N] <= thermoSum; 





//Calculate shamt for each incoming function T from the MUX 
generate 
for (i=0; i<N; i=i+1) 
begin:shiftCalc 
always@*//(shamt[i+1], inToPipe[N-1-i]) 
begin: shamt_setup 
shamt[i] = shamt[i+1l] + !inToPipe[N-1-i]; 
end 
end 
endgenerate 








//set occ bits based on res contents 
generate 
for (i=0; i<2*N-1; i=i+1) 
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begin: occ_connect 
alwayse@* 
if(res[i]) occl[ 


) ] <= 1'bl; 
else occ[i] < : 


‘i. 

1'b0O; 
end 

endgenerate 


//Assign to resTemp (wires to the reservoir registers) the proper input, based on xfer table & inToPipe 
// Accomplished through use of priority encoders 
generate 

for (i=0; i<2*N-2; i=i+1) 

begin: Cases 











if (i<N) 
begin 
pri_enc #(.n(n),.s(N)) pi_i(in_1,shamt_sel [i*N+N-2:1*N],inToPipe[N-1:0],out[i]); /fEOE 2 
>= N, pri_enc doesn't need entire 'in_1"', so pruning will occur, shamt's are each 5 bits 
end 
else// (i<2*N-2) 
begin 
pri_enc #(.n(n),.s(2*N-1-i) ) pi_2 (in_1[N*N-1: (i-N+1) *N],shamt_sel [shamt_idx(i+1)- 
1:shamt_idx(i)],inToPipe[N-1:i+1-N],out[i]); //parring should occur 
end 
end 
endgenerate 
//Constant function to generate indicies of shamt_1l in the generat lseif (i<2*N-2) section of pri_enc calls 


function integer shamt_idx(input integer index); 
integer k; 
integer j; 
integer test; //added for XST 
begin 
k=1; 
shamt_idx=N*N; 
for (j=index; N<j; j=j-1) 
begin 
shamt_idx = shamt_idx + N - k; 
k=k+1; 
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end 
end 
endfunction 
/* 


generate 


for (i=0; i<2*N-1; i=i+1) 


begin: Pipelres 


always@ (posedge CLK) 
res[i] <= res_0Op[i]; 


end 
endgenerate 


ay 


generate 


for (i=0; i<2*N-1; i=i+1) 


begin: reservoir 


if (i<N-1) begin 





always@ (posedge CLK) res[i]/*res_Op[i]*/ = low_res (inFromRes, shamt_sel [i*N+N- 
1:i*N],shamt[i],out[i],res[Nt+i],res[i]); 
end 
else if (i==N-1) begin 
always@ (posedge CLK) res[i]/*res_Op[i]*/ = low_res (inFromRes, shamt_sel [i*N+N- 
1:i*N],shamt[i],out[i],{N{1'b0O}},res[i]); 
end 
else if (i<2*N-2)begin //(N-1 < i < 2*N-2) 


a 
N+1{shamt_sel [shamt_idx (i4 


lways@ (posedge 
t+1)-1:shamt_idx(i)]}},out[i],res[il); 


end 
else begin//i==2*N-2 


a 
1],inFromRes, shamt[1],res 
to reswire [N-1] 


lways@(posedge C 





CLK) res[i]/*res_Op[i]*/ = mid_res (inFromRes, {i- 





2*N-2]);//probabl 


end 


end 
endgenerate 


function [N-1:0] low_res 


( 


ly don't need this, 


input inFromRes, 
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LK) res[i]/*res_Op[2*N-2]*/ = if_func_Nbit (in[N-1], inToPipe [N- 


just control occ bit and always assign in[N-1] 














input [N-1:0] sel, 
input [N-1:0] shamt_i, //may not be needed if out is 
input [N-1:0] out, 
input [N-1:0] mid_res, 
input [N-1:0] res 
i 
begin 
if(inFromRes && mid_res) begin //slide middle registers down 
low_res = mid_res; 
end 
else if(sel && out) begin //if sel and outwite are not zero 
low_res = out; 
end 
else low_res = inFromRes ? {N{1'bO}} : res; 
end 
endfunction 
function [N-1:0] mid_res ( input inFromRes, 
input [N-1:0] sel, //couldn't figure out how to 
//input [N-1:0] shamt_i, 
input [N-1:0] out, 
input [N-1:0] res); 
begin 
if(sel && out) begin //if sel and outwite are not zero 
mid_res = out; 
end 
else mid_res = {N{1'bO}}; 
end 
endfunction 


//This is a NOT-IF 


function 


[N-1:0] 


if_func_Nbit ( 


input 
input 
input 
input 
input 


[N-1:0] in, 

inToPipe, 

inFromRes, 
[SHAMT_WIDTH:0] shamt_i, 
prior_value 
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already zeros 


taper this width 


begin 
if ((3>=shamt_i) && inFromRes) 














begin 
if_func_Nbit = {N{1'b0}}; 
end 
else if ((shamt_i==2*N-2) && !inToPipe) 
begin 
if_func_Nbit = in; 
end 
else 
if_func_Nbit = prior_value; 
end 
endfunction 
endmodule 
module pri_enc #(parameter n=2,s=4) (in, sel, inToPipe, out); 
// 
// pri_enc - Verilog code to implement a priority encoder depending on a parameters, n and m. 
// 
// 
// Created: March 15, 2010 
// Last Modified: July 21, 2010 
// Author: Chris Johnson 
// Adapted from J.T. Butler's 1-bit priority encoder, modified for 
// for busses and select lines in the Circular Pipeline Reservoir. 
// 
// Notes: None. 
// 
// Called by: CircPipeQue.v 
Ti 
// Sub-module calls: sel_module.v 
// iff.v 
// 
// 
parameter N = 2**n; 





//s is number of TT's being input (all MUX's get on 


sane 
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xcept th 


last on 





generated gets 2) 

















localparam SHAMT_WIDTH = n+l; //number of bits for shamt. n is large enough to hold the max transfer 
distance 

input s*N-1:0] in; // in has up to N*N bits; all the applicable incoming functions 

input S=2:5/0)] sel; // sel determines which OUT. Up to N-1 bits. 

input s-1:0] inToPipe; // signal indicating slot in circ pipe is vacant 

output [N-1:0] out ; // OUT is main output of circuit. 

wire s*N-1:0] inc; 

wire (INNER_S(s) —-3) *N+N-1:0] inner; // inner is a line interconnecting 








genvar i; 


//Constant function to provide INNER_S index 











function integer INNER_S(input integer s); 
begin 
if (s>2) 
INNER_S = s; 
else 
INNER_S = 3; 
end 
endfunction 


//Bring TT in if it's rejected from the circular pipeline, else don't bring it in. 
generate 
for (i=0; i<s; i=i+1) 
begin: ifinToPipe 
iff #(.N(N)) u5 (in[i*N+N-1:i*N],inToPipe[i],inC[i*N+N-1:i*N]); 














end 
endgenerate 
// Within the generate for loop below, if statements handle (3) special interconnection 
// requirements, beginning, end, and middle. 
generate 


for (i=0; i<s-1; i=i+1) 
begin: stage 
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assign inner[N-1:0] = inC[s*N-1:s*N-N]; 
if (1 == 0) 
sel_module (.N(N)) ul (inner[N-1:0], inC[N-1:0], 
sel[il], out); 
else if (i == (s-2)) 
sel_module (.N(N)) u2 (inC[s*N-1:s*N-N], inC[s*N-N-1:s*N-2*N], 
inner[(i-1)*N+N-1:(i-1)*N]); //in case of s=2, input 2 (inC) is repeated from MUX_0 
else 
sel_module (.N(N)) u3 (inner [i*N+N-1:i*N], inC[i*N+N-1:i*N], 
inner[ (i-1) *N+N-1: (i-1) *N]); 
end 
endgenerate 
endmodule 
// 
// sel_module - Selector module. Basically, a MUX. 
is 
// 
// Created: March 30, 2010 
// Last Modified: July 21, 2010 
// Ruthor: Chris Johnson 
// 
// 
// Notes: None. 
// 
// Called by: pri_enc.v 
// 
// Sub-module calls: None. 
// 
// 
// 
module sel_module #(parameter N=4) (sel_0, sel_l, sel, out); 
input [N-1:0] sel_0; 
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sel 





sel 


input [N- 
input 

output [N- 
reg [N- 


always @* 
begin 





end 
endmodule 


1:0] sel_1; 





sel; 
1:0] out; 
1:0] out; 
if (sel == 1) out <= sel_l; 


else out <= sel_0; 





// iff - Simply and if statement, 


// Created: 


March 30, 2010 


// Last Modified: July 21, 2010 


// Ruthor: 


// Notes: 


Chris Johnson 


None. 


// Called by: iff.v 


// Sub-module calls: None. 


used for calls within a generat 


statement 








// 


module iff #(parameter N=4) (in,inToPipe, out); 


input 
input 
output 


reg 


alwayse@* 


[N-1:0] in; 
inToPipe; 
[N-1:0] out; 


[N-1:0] out; 
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begin 
if (!inToPipe 
out <= 
else 
out <= 
end 


endmodule 


module thermo_adde 


) 


in; 


{N{1'bO}}; 


vr #(parameter n = 2) (occup, sum); 





// 
// thermo_adder - 


Verilog code to compute the sum of a 2%n bit input, occupp. 
occupp is the set of bits from the stages in the reservoir 














// that indicate whether the stage is occuppied (1) or not (0). 
// The bits from occupp is a thermometer. So, if occupp(i) = 1, 
fi then occupp(j) = 1 for all j < i. This results in a simpler 
// circuit. 

// 

// Created: January 31, 2010 

// Last Modified: 21 July 2010 

// Author: Jon T. Butler 

// Modified: Chris Johnson 

// 

// Called by: CircPipeQue.v 

ii 

// Sub-module calls: None. 

// 

// 

// 


localparam N = 2** 


input [N-2:0] 
output reg [n-1:0] 


n; 


occup; // occupp has 2¢n bits. 
sum; // sum is an n-bit number indicating how many input bits are 1. 
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integer index, g; 


always @* 
if (occup[N-2] 


1'bl) 














sum[n-1:0] = {{n{1l'bl}}}; 
else 
begin 
sum[n-1] = 1'b0; 
index = 2**(n-1)-1; 
for (g=n-1; g>=0; g g-1) 
begin 
if (occup[index] == 1'bl) 

begin 
sum[g] = 1; 
index = index + 2**(g-1); 

end 

else 

begin 
sum[g] = 0; 
index = index - 2**(g-1); 

end 

end 
end 
endmodule 
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B. 


SRC-6 IMPLEMENTATION FILES 


1. main.c 


TITTTTTTTLTTTTTTTTT TTT TTT TTT TTT TTT TATA TTT ATTA AAT ATTA TATA ATTA 











{* */ 
/* main.c - C program to test an SRC-6E implementation of min.v */ 
[X My. 
/* Author: Chris Johnson *y, 
[# Created: August 1, 2010 */ 
fe Last modified: September 3, 2010 ay. 
f* */ 
[* Description: This program searches for bent functions using the */ 
/* circular pipeline with IFGs Ay 
je Gf. 
/* */ 


[KOR KR KKK KK KR KK I I I I I I / 


#incl 
#incl 
#incl 


void 


lude <map.h> 
lude <stdlib.h> 
lude <string.h> 





subr (int64_t*, 


int main () { 


int64_t*, int64 t*, int64_t*, inte4_t*, inte4_t*, 


int i, j,mapnum=0; 


int64_t time_ 


int64_t *ino, 


clk, rl, r2, cmin[32], invalc; 
*inl, *in2, *in3, *BENT, *REJECT, *STAGE_TT_out; 




















int8_t *valid_out; 





/* Allocate array of x values, in, and array of function values, out 
in0d (int64_t *) malloc (4096* sizeof (int64_t)); 
inl = (int64_t *) malloc (4096* sizeof (int64_t)); 
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int64_t*, 


we 


Ines: ts, 


int64_t*, 


int); 


if. 


/* 


/* 


1, 





in2 = (int64_t *) malloc (4096* sizeof (int64_t)) 

in3 = (int64_t *) malloc (4096* sizeof (int64_t)); 

BENT = (int64_t *) malloc (4096* sizeof (int64_t)); 

STAGE _TT_out = (int64_t *) malloc (4096* sizeof (int64_t)); 




















for (i = 0; i < 4096; itt) { 
[i] = 12816;//3210 
[i] = 30292;//7654 

in2[i] = 47768; //AB98 
[i] = 65244;//FEDC 
[i] =; 





map_allocate (1); 


Call subroutine subr.mc on the MAP. 
subr (inO, inl, in2, in3, &time_clk, REJECT, BENT, valid_out, STAGE_TT_out, 























Print out the number of clocks. af 
printf ("%lld clocks\n", time_clk); 


Print out the output. a) 
for (i=0; 1<4096; i++) { 
printf ("BENT: %x \n",BENT[i]); 
if (out [i]) 
printf ("PartialStageTT: %x \n",out[i]); 
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mapnum) ; 


2. subr.mc 


[KOR KKK KK KK KK KR I I I I I I A A ORK OK / 











[* %] 
/*  subr.mc - MAP C subroutine to cue TT's for ciruclear pipeline. Ay: 
[® */ 
px Author: Chris Johnson */. 
/* Created: June 14, 2010 x 
/* Last modified: September 3, 2010 ] 
/* %h 
[* Description: This program calls an SRC-6 macro that seives */ 
fe functions through a circular pipeline. %/ 
pe 4 
f* */ 
iss Ay 


[KOK KKK KK KK KK KK I I I I I I OK OK / 


#include <libmap.h> 


void subr (int64 t inO[], inté64_t inl[], int64_t in2[], int64 t in3[], intéeé4_t *time, int64 t reject[], 
int64_t bent[], int8_t valid_out, int64 t tt[], int mapnum) { 


// Declare one OBM banks in SRC-6 to store... 











OBM_BANK_A (INO, int64_t, 1024) 
OBM_BANK_B (BENT_o, int64_t, 4096) 
OBM_BANK_C (INI, int64_t, 1024) 
OBM_BANK_D (IN2, int64_t, 1024) 
OBM_BANK_E (IN3, int64_t, 1024) 
OBM_BANK_F (TT_o, int64_t, 4096) 
int64_t my64bit_inO, my64bit_inl, my64bit_in2, my64bit_in3, REJECT, BENT, stage_TT_out, t0, t1; 








int8_t VALID_OUT; //only need 1 bit 
int i; 


// Get values by DMAing FROM the CPU 
DMA_CPU (CM20BM, INO, MAP_OBM_stripe(1,"A"), in0O, 1, 1024*sizeof(int64_t), 0); 
DMA_CPU (CM20BM, IN1, MAP_OBM_stripe(1,"C"), inl, 1, 1024*sizeof(int64_t), 0); 
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DMA_CPU (CM20BM, IN2, MAP_OBM_stripe(1,"D"), in2, 1, 1024*sizeof(int64_t), 0); 
DMA_CPU (CM20BM, IN3, MAP_OBM_stripe(1,"E"), in3, 1, 1024*sizeof(int64_t), 0); 
wait_DMA (0); 





read_timer (&t0); 


for (i = 0; i < 1024; itt) { 

// The my_operator macro call has 2 inputs, IN and INTOPIPE, and one output, OUT 

my64bit_inO = INO[i]; 

my64bit_inl = IN1[i]; 

my64bit_in2 IN2 [i]; 

my64bit_in3 IN3 [i]; 

my_operator (my64bit_inO, my64bit_inl, my64bit_in2, my64bit_in3, REJECT, BENT, VALID_OUT, 
stage_TT_out); 
BENT_o[i] = BENT; 
TT_o[i] = stage_TT_out; 


























} 
read_timer (&t1); 
*time = (tl —- t0); 
// Return values by DMAing TO the CPU 
DMA_CPU (OBM2CM, BENT_o, MAP_OBM_stripe(1,"B"), bent, 1, 4096*sizeof(int64_t), 0); 


DMA_CPU (OBM2CM, TT_o, MAP_OBM_stripe(1,"F"), tt, 1, 4096*sizeof(int64_t), 0); 
wait_DMA (0); 











3. makefile 


# SId: Makefile.template,v 1.13 2005/04/12 19:18:30 jls Exp $ 
# 
# Copyright 2003 SRC Computers, Inc. All Rights Reserved. 
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Manufactured in the United States of America. 


SRC Computers, Inc. 

4240 N Nevada Avenue 
Colorado Springs, CO 80907 
(v) (719) 262-0213 

(f£) (719) 262-0223 


No permission has been granted to distribute this software 
without the express permission of SRC Computers, Inc. 


This program is distributed WITHOUT ANY WARRANTY OF ANY KIND. 











User defines FILES, MAPFILES, and BIN here 























FILES = main.c 
MAPFILES = subr.mc 
BIN = main 





Multi chip info provided here 
(Leave commented out if not used) 








PRIMARY = <primary file 1> <primary file 2> 
SECONDARY = <secondary file 1> <secondary file 2> 
CHIP2 = <file to compile to user chip 2> 














User defined directory of code routines 
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that are to be inlined 





INLINEDIR = 








User defined macros info supplied here 





(Leave commented out if not used) 











MACROS = my_macro/CircPipe.v 
MY _BLKBOX = my_macro/blk.v 
MY_NGO_DIR = my_macro 

MY_INFO = my_macro/info 





Floating point macros selection 














FPMOD! 
FPMOD! 


SRC_IEEE_V1 Default SRC version IEEE 
SRC_IEEE_V2 Size reduced SRC IEEE with 
special rounding mode 






































Gl GI 





























User supplied MCC and MFTIN flags 





























MCCFLAGS = Sv 

MF TNF LAGS Sy 

# 

# User supplied flags for C & Fortran compilers 

# 

CC = gcc gce for Intel cc for Gnu 

FC = ifort ifort for Intel £77 for Gnu 

#LD = ifort -nofor_main # for mixed C and Fortran, main in C 
#LD = ifort for Fortran or C/Fortran mixed, main in Fortran 
LD = gcc for C codes 
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MY_CF] 
MY_FF] 





LAGS 
LAGS 


MY_LDF LAGS 


# Flags to include libs if needed 








VCS simulation 
(Set as needed, 


settings 
otherwise just leave commented 























out) 


yes to use vcs instead of vcsi 





yes to generat 











(Set as needed, 


USEVCS = yes YES or 
VCSDUMP = yes YES or 
MODELSIM simulation settings 





otherwise just 


leave commented 








US 
US 
MD 


EMDL 





EMD 











iGUI 
LDUMP 


= yes 
= yes 
= yes 

















YES or 


x 


Y: 


ES or 





ES or 


yes to use model 
yes to use model 
yes to generate 








No modifications ar 





required below 











MAK 


IN 


vced+ trace dump 


out) 


lsim instead of vces/vcsi 
lsim GUI interface 
ved trace dump 


?= $(MC_ROOT) /opt/srcci/comp/lib/AppRules.make 
include $ (MAKIN) 


4. 


info 


J [RRR KKK KKK KK KK KK I I IR I A I I OR I KK / 


ple 
LES 
if 
Ty 
//* 
[/® 


info - info file to specify the input and output of macro CircPipeCue 


Author: 
Created: 
Last modified: 


Chris Johnson 
August 2, 2010 





September 3, 2010 
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ay 


ee 


x7 


J [RRR RRR KK KK KK KK KK I I A I I OR KK / 





BEGIN_D 























END_D 





KE 





3: 



































EF "my_operator" 
MACRO = "CircPipe"; 
STATEFUL = NO; 
EXTERNAL = NO; 
PIPELINED = YES; 
AATENCY = 0; 

INPUTS = 
I0 = INT 64 BITS 


Il = INT 64 BITS 
I2 = INT 64 BITS 
I3 = INT 64 BITS 


OUTPUTS = 
00 = 
Ol = INT 
O02 = INT 

O35 





IN_SIGNAL 


blk.v 





64 BIT 
64 BIT 





cS 
cS 





8 BIT 


1 BITS 


cS 


INT 64 BITS 


W CLK" 


//Name used in 


//Macro name. 


( 
( 
( 
( 


FNCSO 
FNCS1 
FNCS2 
FNCS3 


O70): SS 


[64 
[64: 
[64: 
[64 





(R 


BT. 





ECT[63:0]) 











(B 


EN 
(valid_out[7:0]) 


[[63:0]) 








(S 





"CLOCK"; 


Rh ae 


.mc file to call macro. 


//only need 1 bit 


TAGE_TT_out [63:0]) 


[KOR KR KKK KK KR I I I I I RK KK / 


/* 


/* blk.v - black-box file that specifies input and output 


/* 
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*/ 
ee 
Aa 


/* 
/* 
/* 
/* 


Author: 
Created: 


Last modified: 


Chris Johnson 
August 1, 201 





September 3, 


0 
2010 


AY: 
a/ 

ae 
a1, 


1 et ee ee i ee ee ef 


module CircPipe 











input CLK; 

input 63:0 
input 63:0 
input 63:0 
input 63:0 
output [63:0 
output [7:0] 
output [63:0 
output [63:0 

endmodule 





DBuogaWDtaeay? 


NCSO; 

NCS1; 

NCS2; 

NCS3; 

EJECT; 
alid_out; 
tage_TT_out; 
ENT; 














(CLK, FNCSO, FNCS1,FNCS2,FNCS3,R 








BT. 





ECT,B 





ENT, valid_out, STAGE _TT_out); 
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